an introduction to time series with team apache
TRANSCRIPT
![Page 1: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/1.jpg)
@PatrickMcFadin
Patrick McFadinChief Evangelist for Apache Cassandra, DataStax
Process, store, and analyze like a boss with Team Apache: Kafka, Spark, and Cassandra
1
![Page 2: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/2.jpg)
Agenda
• Lecture
• Kafka
• Spark
• Cassandra
• Hands on
• Verify Cassandra up and running
• Load data into Cassandra
• Break 3:00 - 3:30
• Lecture
• Cassandra (continued)
• Spark and Cassandra
• PySpark
• Hands On
• Spark Shell
• Spark SQL
Section 1 Section 2
![Page 3: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/3.jpg)
About me• Chief Evangelist for Apache Cassandra • Senior Solution Architect at DataStax • Chief Architect, Hobsons • Web applications and performance since 1996
![Page 4: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/4.jpg)
What is time series data?
A sequence of data points, typically consisting of successive measurements made over a time interval.
Source: https://en.wikipedia.org/wiki/Time_series
![Page 5: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/5.jpg)
5
![Page 6: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/6.jpg)
6
Underpants Gnomes
Step 1
Data Gnomes
Step 2 Step 3
Collect Data ? Profit!
![Page 7: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/7.jpg)
What is time series analysis?
Methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data.
Source: https://en.wikipedia.org/wiki/Time_series
![Page 8: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/8.jpg)
V V V
![Page 9: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/9.jpg)
Velocity
Volume
Variety
![Page 10: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/10.jpg)
Internet of Things
![Page 11: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/11.jpg)
June 29, 2007
11
![Page 12: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/12.jpg)
Bring in the team
![Page 13: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/13.jpg)
Team Apache
Collect Process Store
![Page 14: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/14.jpg)
CassandraAkka
SparkKafka
Organize Process Store
Mesos
KafkaKafkaKafka SparkSparkSpark
AkkaAkkaAkka CassandraCassandraCassandra
![Page 15: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/15.jpg)
2.1 Kafka - Architecture and Deployment
![Page 16: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/16.jpg)
The problem
Kitchen
Hamburger please
Meat disk on bread please
![Page 17: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/17.jpg)
The problem
Kitchen
![Page 18: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/18.jpg)
The problem
Kitchen
Order Queue
Hamburger please
Order
![Page 19: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/19.jpg)
The problem
Kitchen
Order Queue
![Page 20: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/20.jpg)
The problem
Kitchen
Order Queue
Meat disk on bread please
You mean a Hamburger?
Uh yeah. That.
Order
![Page 21: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/21.jpg)
Order from chaosProducer
Consumer
Topic = FoodOrder
![Page 22: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/22.jpg)
Order from chaosProducer
Topic = Food
Order
1
Consumer
![Page 23: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/23.jpg)
Order from chaosProducer
Topic = Food
Order
1
Order
Consumer
![Page 24: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/24.jpg)
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
![Page 25: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/25.jpg)
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
![Page 26: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/26.jpg)
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
![Page 27: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/27.jpg)
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
![Page 28: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/28.jpg)
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
![Page 29: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/29.jpg)
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
![Page 30: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/30.jpg)
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
![Page 31: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/31.jpg)
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
Order
![Page 32: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/32.jpg)
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
Order
5
![Page 33: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/33.jpg)
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
Order
5
![Page 34: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/34.jpg)
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
Order
5
![Page 35: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/35.jpg)
Order from chaosProducer
Topic = Food
Order
1
Order
2
Consumer
Order
3
Order
4
Order
5
![Page 36: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/36.jpg)
ScaleProducer
Topic = Hamburgers
Order
1
Order
2
Consumer
Order
3
Order
4
Order
5
Topic = Pizza
Order
1
Order
2
Order
3
Order
4
Order
5
Topic = Food
![Page 37: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/37.jpg)
KafkaProducer
Topic = Temperature
Temp
1
Temp
2
Consumer
Temp
3
Temp
4
Temp
5
Collection API
Temperature Processor
Topic = Precipitation
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5Precipitation Processor
Broker
![Page 38: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/38.jpg)
KafkaProducer
Topic = Temperature
Temp
1
Temp
2
Consumer
Temp
3
Temp
4
Temp
5
Collection API
Temperature Processor
Topic = Precipitation
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5Precipitation Processor
Broker
Partition 0
Partition 0
![Page 39: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/39.jpg)
KafkaProducer Consumer
Collection API
Temperature Processor
Precipitation Processor
Topic = Temperature
Tem1
Temp
2Tem
3
Temp
4
Temp
5
Topic = Precipitation
Precip
1
Precip
2
Precip
3
Precip
4
Precip
5
Broker
Partition 0
Partition 0
Tem1
Temp2
Tem3
Temp4
Temp5
Partition 1 Temperature Processor
![Page 40: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/40.jpg)
KafkaProducer Consumer
Collection API
Temperature Processor
Precipitation Processor
Topic = Temperature
Tem1
Temp
2Tem
3
Temp
4
Temp
5
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Tem1
Temp
2Tem
3
Temp
4
Temp
5Partition 1
Temperature Processor
Topic = Temperature
Tem1
Temp
2Tem
3
Temp
4
Temp
5
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Tem1
Temp
2Tem
3
Temp
4
Temp
5Partition 1
Topic TemperatureReplication Factor = 2
Topic PrecipitationReplication Factor = 2
![Page 41: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/41.jpg)
KafkaProducer
Consumer
Collection API
Temperature Processor
Precipitation Processor
Topic = Temperature
Tem1
Temp
2Tem
3
Temp
4
Temp
5
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Tem1
Temp
2Tem
3
Temp
4
Temp
5Partition 1 Temperature
Processor
Topic = Temperature
Tem1
Temp
2Tem
3
Temp
4
Temp5
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Tem1
Temp
2Tem
3
Temp
4
Temp
5Partition 1
Temperature Processor
Temperature Processor
Precipitation Processor
Topic TemperatureReplication Factor = 2
Topic PrecipitationReplication Factor = 2
![Page 42: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/42.jpg)
GuaranteesOrder •Messages are ordered as they are sent by the producer
•Consumers see messages in the order they were inserted by the producer
Durability •Messages are delivered at least once •With a Replication Factor N up to N-1 server failures can be tolerated without losing committed messages
![Page 43: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/43.jpg)
3.1 Spark - Introduction to Spark
![Page 44: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/44.jpg)
Map Reduce
Input Data
Map
Reduce
Intermediate Data
Output Data
Disk
![Page 45: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/45.jpg)
Data Science at Scale
2009
![Page 46: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/46.jpg)
In memory
Input Data
Map
Reduce
Intermediate Data
Output Data
Disk
![Page 47: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/47.jpg)
In memory
Input Data
Spark Intermediate Data
Output Data
Disk Memory
![Page 48: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/48.jpg)
Resilient Distributed Dataset
![Page 49: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/49.jpg)
RDDTranformations •Produces new RDD •Calls: filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract
Are •Immutable •Partitioned •Reusable
Actions •Start cluster computing operations •Calls: collect: Array[T], count, fold, reduce..
and Have
![Page 50: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/50.jpg)
API
filter groupBy sort union join leftOuterJoin rightOuterJoin
count fold reduceByKey groupByKey cogroup cross zip
sample
take
first partitionBy mapWith pipe
save ...
reducemap
![Page 51: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/51.jpg)
Spark Streaming
Near Real-time
SparkSQL
Structured Data
MLLib
Machine Learning
GraphX
Graph Analysis
![Page 52: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/52.jpg)
Spark Streaming
Petabytes of data
Gigabytes Per Second
![Page 53: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/53.jpg)
3.1.1 Spark - Architecture
![Page 54: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/54.jpg)
Directed Acyclic Graph
Resilient Distributed Dataset
![Page 55: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/55.jpg)
DAG
RDD
![Page 56: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/56.jpg)
DAG
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
![Page 57: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/57.jpg)
RDDRDD
Data
Input Source
• File
• Database
• Stream
• Collection
![Page 58: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/58.jpg)
RDDRDD
Data
.count() -> 100
![Page 59: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/59.jpg)
PartitionsRDD
Data
Partition 0Partition 1Partition 2Partition 3Partition 4Partition 5Partition 6Partition 7Partition 8Partition 9
Server 1
Server 2
Server 3
Server 4
Server 5
![Page 60: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/60.jpg)
PartitionsRDD
Data
Partition 0Partition 1Partition 2Partition 3Partition 4Partition 5Partition 6Partition 7Partition 8Partition 9
Server 2
Server 3
Server 4
Server 5
![Page 61: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/61.jpg)
PartitionsRDD
Data
Partition 0Partition 1Partition 2Partition 3Partition 4Partition 5Partition 6Partition 7Partition 8Partition 9
Server 2
Server 3
Server 4
Server 5
![Page 62: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/62.jpg)
Workflow
RDDtextFile(“words.txt”)
countWords()
Action
DAG SchedulerPlan
Stage one - Count words
P0
P1
P2
P0
Stage two - Collect counts
![Page 63: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/63.jpg)
Executer
Master
Worker
Executer
Executer
Server
DataStorage
![Page 64: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/64.jpg)
Master
Worker
Worker
Worker Worker
Storage
Storage Storage
Storage
Stage one - Count words
P0
P1
P2
DAG Scheduler
Executer
Narrow Transformation
• filter
• map
• sample
• flatMap
![Page 65: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/65.jpg)
Master
Worker
Worker
Worker Worker
Storage
Storage Storage
Storage
Wide Transformation
P0
Stage two - Collect counts
Shuffle!• join • reduceByKey • union • groupByKey
![Page 66: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/66.jpg)
3.2 Spark - Spark Streaming
![Page 67: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/67.jpg)
The problem domain
Petabytes of data
Gigabytes Per Second
![Page 68: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/68.jpg)
Input Sources
![Page 69: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/69.jpg)
Input Sources
![Page 70: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/70.jpg)
Receiver Based ApproachProducer
Topic = Temperature
Temp1
Temp2
Consumer
Temp3
Temp4
Temp5
Collection API
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Streaming
Streaming
![Page 71: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/71.jpg)
Receiver Based ApproachProducer
Topic = Temperature
Temp1
Temp2
Consumer
Temp3
Temp4
Temp5
Collection API
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Streaming
Streaming
Streaming
Lost Data
![Page 72: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/72.jpg)
Receiver Based ApproachProducer
Topic = Temperature
Temp1
Temp2
Consumer
Temp3
Temp4
Temp5
Collection API
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Streaming
Streaming
Streaming
Write Ahead Log
![Page 73: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/73.jpg)
val kafkaStream = KafkaUtils.createStream(streamingContext, [ZK quorum], [consumer group id], [per-topic number of Kafka partitions to consume])
ZookeeperServer IP Consumer
Group CreatedIn Kafka
List of Kafka topics and number of threads per topic
Receiver Based Approach
![Page 74: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/74.jpg)
Producer
Topic = Temperature
Temp1
Temp2
Consumer
Temp3
Temp4
Temp5
Collection API
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Streaming
Streaming
Direct Based Approach
![Page 75: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/75.jpg)
Producer
Topic = Temperature
Temp1
Temp2
Consumer
Temp3
Temp4
Temp5
Collection API
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Streaming
Streaming
Direct Based Approach
![Page 76: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/76.jpg)
Producer
Topic = Temperature
Temp1
Temp2
Consumer
Temp3
Temp4
Temp5
Collection API
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Streaming
Streaming
Direct Based Approach
Streaming
![Page 77: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/77.jpg)
Producer
Topic = Temperature
Temp1
Temp2
Consumer
Temp3
Temp4
Temp5
Collection API
Topic = Precipitation
Precip1
Precip2
Precip3
Precip4
Precip5
Broker
Partition 0
Partition 0
Streaming
Streaming
Direct Based Approach
Streaming
![Page 78: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/78.jpg)
Direct Based Approach
val directKafkaStream = KafkaUtils.createDirectStream[ [key class], [value class], [key decoder class], [value decoder class] ]( streamingContext, [map of Kafka parameters], [set of topics to consume])
List of Kafka brokers(and any other params) Kafka topics
![Page 79: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/79.jpg)
3.2.2 Spark - Streaming Windows and Slides
![Page 80: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/80.jpg)
Discretized Stream
![Page 81: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/81.jpg)
DStream
Kafka
![Page 82: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/82.jpg)
DStream
Kafka
![Page 83: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/83.jpg)
DStream
Kafka
![Page 84: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/84.jpg)
DStream
Kafka
![Page 85: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/85.jpg)
DStream
Kafka
![Page 86: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/86.jpg)
DStream
Kafka
![Page 87: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/87.jpg)
DStream
Kafka
![Page 88: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/88.jpg)
DStream
Kafka
![Page 89: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/89.jpg)
DStream
Kafka
![Page 90: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/90.jpg)
DStream
Kafka
![Page 91: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/91.jpg)
DStream
Kafka
![Page 92: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/92.jpg)
DStream
Kafka
![Page 93: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/93.jpg)
DStream
Kafka
![Page 94: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/94.jpg)
DStream
Kafka
![Page 95: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/95.jpg)
DStream
Kafka
Discrete by time
![Page 96: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/96.jpg)
DStream
Individual Events
Discrete by timeDStream = RDD
![Page 97: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/97.jpg)
DStream
X Seconds
DStream
Transform
.countByValue
.reduceByKey
.join
.map
![Page 98: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/98.jpg)
T0 1 2 3 4 5 6 7 8 9 10 11
1 SecWindow
![Page 99: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/99.jpg)
T0 1 2 3 4 5 6 7 8 9 10 11
Event DStream
Transform DStream
Transform
![Page 100: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/100.jpg)
T0 1 2 3 4 5 6 7 8 9 10 11
Event DStream
Transform DStream
Transform
![Page 101: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/101.jpg)
T0 1 2 3 4 5 6 7 8 9 10 11
Event DStream
Transform DStream
Transform
![Page 102: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/102.jpg)
T0 1 2 3 4 5 6 7 8 9 10 11
Event DStream
Transform DStream
![Page 103: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/103.jpg)
T0 1 2 3 4 5 6 7 8 9 10 11
Event DStream
Transform DStream
SlideTransform
![Page 104: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/104.jpg)
T0 1 2 3 4 5 6 7 8 9 10 11
Event DStream
Transform DStream
SlideTransform
![Page 105: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/105.jpg)
T0 1 2 3 4 5 6 7 8 9 10 11
Event DStream
Transform DStream
Transform
![Page 106: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/106.jpg)
Window •Amount of time in seconds to sample data •Larger size creates memory pressure
Slide •Amount of time in seconds to advance window
DStream •Window of data as a set •Same operations as an RDD
![Page 107: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/107.jpg)
4.1 Cassandra - Introduction
![Page 108: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/108.jpg)
My Background
…ran into this problem
![Page 109: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/109.jpg)
How did we get here?
1960s and 70s
![Page 110: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/110.jpg)
How did we get here?
1960s and 70s 1980s and 90s
![Page 111: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/111.jpg)
How did we get here?
1960s and 70s 1980s and 90s 2000s
![Page 112: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/112.jpg)
How did we get here?
1960s and 70s 1980s and 90s 2000s 2010
![Page 113: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/113.jpg)
Gave it my best shot
shard 1 shard 2 shard 3 shard 4
router
client
Patrick,All your wildest
dreams will come true.
![Page 114: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/114.jpg)
Just add complexity!
![Page 115: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/115.jpg)
A new plan
![Page 116: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/116.jpg)
Dynamo Paper(2007)• How do we build a data store that is:
• Reliable • Performant • “Always On”
• Nothing new and shiny
Evolutionary. Real. Computer Science
Also the basis for Riak and Voldemort
![Page 117: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/117.jpg)
BigTable(2006)
• Richer data model • 1 key. Lots of values • Fast sequential access • 38 Papers cited
![Page 118: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/118.jpg)
Cassandra(2008)
• Distributed features of Dynamo • Data Model and storage from
BigTable • February 17, 2010 it graduated to
a top-level Apache project
![Page 119: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/119.jpg)
Cassandra - More than one server
• All nodes participate in a cluster • Shared nothing • Add or remove as needed • More capacity? Add a server
119
![Page 120: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/120.jpg)
120
Cassandra HBase Redis MySQL
THRO
UG
HPU
T O
PS/S
EC)
VLDB benchmark (RWS)
![Page 121: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/121.jpg)
Cassandra - Fully Replicated
• Client writes local • Data syncs across WAN • Replication per Data Center
121
![Page 122: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/122.jpg)
A Data Ocean or Pond., Lake
An In-Memory Database
A Key-Value Store
A magical database unicorn that farts rainbows
![Page 123: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/123.jpg)
Cassandra for Applications
APACHE
CASSANDRA
![Page 124: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/124.jpg)
Hands On!
https://github.com/killrweather/killrweather/wiki/6.-Cassandra-Exercises-on-Killrvideo-Data
KillrWeather Wiki
![Page 125: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/125.jpg)
4.1.2 Cassandra - Basic Architecture
![Page 126: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/126.jpg)
Row
Column 1
Partition Key 1
Column 2
Column 3
Column 4
![Page 127: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/127.jpg)
Partition
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 1
Column 2
Column 3
Column 4
![Page 128: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/128.jpg)
Partition with Clustering
Cluster 1
Partition Key 1
Column 1
Column 2
Column 3
Cluster 2
Partition Key 1
Column 1
Column 2
Column 3
Cluster 3
Partition Key 1
Column 1
Column 2
Column 3
Cluster 4
Partition Key 1
Column 1
Column 2
Column 3
![Page 129: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/129.jpg)
Table Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 2
Column 2
Column 3
Column 4
Column 1
Column 2
Column 3
Column 4
Column 1
Column 2
Column 3
Column 4
Column 1
Column 2
Column 3
Column 4
Partition Key 2
Partition Key 2
Partition Key 2
![Page 130: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/130.jpg)
Keyspace
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 2
Column 2
Column 3
Column 4
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 2
Column 2
Column 3
Column 4
Column 1
Partition Key 2
Column 2
Column 3
Column 4
Column 1
Partition Key 2
Column 2
Column 3
Column 4
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 2
Column 2
Column 3
Column 4
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 1
Column 2
Column 3
Column 4
Column 1
Partition Key 2
Column 2
Column 3
Column 4
Column 1
Partition Key 2
Column 2
Column 3
Column 4
Column 1
Partition Key 2
Column 2
Column 3
Column 4
Table 1 Table 2Keyspace 1
![Page 131: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/131.jpg)
NodeServer
![Page 132: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/132.jpg)
TokenServer•Each partition is a 128 bit value
•Consistent hash between 2-63 and 264 •Each node owns a range of those values
•The token is the beginning of that range to the next node’s token value
•Virtual Nodes break these down further
Data
Token Range
0 …
![Page 133: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/133.jpg)
The cluster Server
Token Range
0 0-100
0-100
![Page 134: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/134.jpg)
The cluster Server
Token Range
0 0-50
51 51-100
Server
0-50
51-100
![Page 135: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/135.jpg)
The cluster Server
Token Range
0 0-25
26 26-50
51 51-75
76 76-100Server
ServerServer
0-25
76-100
26-5051-75
![Page 136: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/136.jpg)
4.1.3 Cassandra - Replication, High Availability and Multi-datacenter
![Page 137: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/137.jpg)
Replication10.0.0.1 00-25
DC1
DC1: RF=1
Node Primary
10.0.0.1 00-25
10.0.0.2 26-50
10.0.0.3 51-75
10.0.0.4 76-100
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
![Page 138: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/138.jpg)
Replication10.0.0.1
00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
DC1
DC1: RF=2
Node Primary Replica
10.0.0.1 00-25 76-100
10.0.0.2 26-50 00-25
10.0.0.3 51-75 26-50
10.0.0.4 76-100 51-75
76-100
00-25
26-50
51-75
![Page 139: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/139.jpg)
ReplicationDC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
![Page 140: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/140.jpg)
ConsistencyDC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15
![Page 141: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/141.jpg)
Repair
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
ClientRepair = Am I consistent?
You are missing some data. Here. Have some of mine.
![Page 142: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/142.jpg)
Consistency level
Consistency Level Number of Nodes Acknowledged
One One - Read repair triggered
Local One One - Read repair in local DC
Quorum 51%
Local Quorum 51% in local DC
![Page 143: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/143.jpg)
ConsistencyDC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15 CL= One
![Page 144: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/144.jpg)
ConsistencyDC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15 CL= One
![Page 145: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/145.jpg)
ConsistencyDC1
DC1: RF=3
Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15 CL= Quorum
![Page 146: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/146.jpg)
Multi-datacenterDC1
DC1: RF=3Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15
DC2
10.1.0.1 00-25
10.1.0.4 76-100
10.1.0.2 26-50
10.1.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Node Primary Replica Replica
10.1.0.1 00-25 76-100 51-75
10.1.0.2 26-50 00-25 76-100
10.1.0.3 51-75 26-50 00-25
10.1.0.4 76-100 51-75 26-50
DC2: RF=3
![Page 147: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/147.jpg)
Multi-datacenterDC1
DC1: RF=3Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15
DC2
10.1.0.1 00-25
10.1.0.4 76-100
10.1.0.2 26-50
10.1.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
DC2: RF=3Node Primary Replica Replica
10.1.0.1 00-25 76-100 51-75
10.1.0.2 26-50 00-25 76-100
10.1.0.3 51-75 26-50 00-25
10.1.0.4 76-100 51-75 26-50
![Page 148: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/148.jpg)
Multi-datacenterDC1
DC1: RF=3Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Write to partition 15
DC2
10.1.0.1 00-25
10.1.0.4 76-100
10.1.0.2 26-50
10.1.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
DC2: RF=3Node Primary Replica Replica
10.1.0.1 00-25 76-100 51-75
10.1.0.2 26-50 00-25 76-100
10.1.0.3 51-75 26-50 00-25
10.1.0.4 76-100 51-75 26-50
![Page 149: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/149.jpg)
4.2.1 Cassandra - Weather Website Example
![Page 150: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/150.jpg)
Example: Weather Station
• Weather station collects data • Cassandra stores in sequence • Application reads in sequence • Aggregations in fast lookup table
Windsor California July 1, 2014
High: 73.4 Low : 51.4
Precipitation: 0.0 2014 Total: 8.3”
Weather for Windsor, California as of 9PM PST July 7th 2015
Current Temp: 71 F
Daily Precipitation: 0.0”
Up-to-date Weather
High: 85 F
Low 58 F
2015 Total Precipitation: 12.0 “
![Page 151: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/151.jpg)
Weather Web Site
CassandraOnly DC
Cassandra+ Spark DC
Spark Jobs
Spark Streaming
![Page 152: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/152.jpg)
Success starts with…
The data model!
![Page 153: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/153.jpg)
Relational Data Models• 5 normal forms • Foreign Keys • Joins
deptId First Last1 Edgar Codd2 Raymond Boyce
id Dept
1 Engineering
2 Math
Employees
Department
![Page 154: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/154.jpg)
Relational Modeling
Data
Models
Application
![Page 155: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/155.jpg)
Cassandra Modeling
Data
Models
Application
![Page 156: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/156.jpg)
CQL vs SQL• No joins • Limited aggregations
deptId First Last1 Edgar Codd2 Raymond Boyce
id Dept
1 Engineering
2 Math
Employees
DepartmentSELECT e.First, e.Last, d.DeptFROM Department d, Employees eWHERE ‘Codd’ = e.LastAND e.deptId = d.id
![Page 157: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/157.jpg)
Denormalization• Combine table columns into a single view • No joins
SELECT First, Last, Dept FROM employees WHERE id = ‘1’
id First Last Dept
1 Edgar Codd Engineering
2 Raymond Boyce Math
Employees
![Page 158: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/158.jpg)
Queries supported
CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
Get weather data given •Weather Station ID •Weather Station ID and Time •Weather Station ID and Range of Time
![Page 159: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/159.jpg)
Aggregation Queries
CREATE TABLE daily_aggregate_temperature ( wsid text, year int, month int, day int, high double, low double, mean double, variance double, stdev double, PRIMARY KEY ((wsid), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
Get temperature stats given •Weather Station ID •Weather Station ID and Time •Weather Station ID and Range of Time
Windsor California July 1, 2014
High: 73.4
Low : 51.4
![Page 160: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/160.jpg)
daily_aggregate_precip
CREATE TABLE daily_aggregate_precip ( wsid text, year int, month int, day int, precipitation counter, PRIMARY KEY ((wsid), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
Get precipitation stats given •Weather Station ID •Weather Station ID and Time •Weather Station ID and Range of Time
Windsor California July 1, 2014
High: 73.4 Low : 51.4 Precipitation: 0.0
![Page 161: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/161.jpg)
year_cumulative_precip
CREATE TABLE year_cumulative_precip ( wsid text, year int, precipitation counter, PRIMARY KEY ((wsid), year) ) WITH CLUSTERING ORDER BY (year DESC);
Get latest yearly precipitation accumulation •Weather Station ID •Weather Station ID and Time •Provide fast lookup
Windsor California July 1, 2014
High: 73.4 Low : 51.4
Precipitation: 0.0 2014 Total: 8.3”
![Page 162: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/162.jpg)
4.2.1.1.1 Cassandra - CQL
![Page 163: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/163.jpg)
Table
CREATE TABLE weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, PRIMARY KEY(id) );
Table Name
Column NameColumn CQL Type
Primary Key Designation Partition Key
![Page 164: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/164.jpg)
Table
CREATE TABLE daily_aggregate_precip ( wsid text, year int, month int, day int, precipitation counter, PRIMARY KEY ((wsid), year, month, day) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC);
Partition KeyClustering Columns
Order Override
![Page 165: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/165.jpg)
Insert
INSERT INTO weather_station (id, call_sign, country_code, elevation, lat, long, name, state_code) VALUES ('727930:24233', 'KSEA', 'US', 121.9, 47.467, -122.32, 'SEATTLE SEATTLE-TACOMA INTL A', ‘WA');
Table Name Fields
Values
Partition Key: Required
![Page 166: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/166.jpg)
Lightweight Transactions
INSERT INTO weather_station (id, call_sign, country_code, elevation, lat, long, name, state_code) VALUES ('727930:24233', 'KSEA', 'US', 121.9, 47.467, -122.32, 'SEATTLE SEATTLE-TACOMA INTL A', ‘WA’) IF NOT EXISTS;
Don’t overwrite!
![Page 167: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/167.jpg)
Lightweight Transactions
CREATE TABLE IF NOT EXISTS weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, PRIMARY KEY(id) );
No-op. Don’t throw error
![Page 168: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/168.jpg)
Select
id | call_sign | country_code | elevation | lat | long | name | state_code--------------+-----------+--------------+-----------+--------+---------+-------------------------------+------------727930:24233 | KSEA | US | 121.9 | 47.467 | -122.32 | SEATTLE SEATTLE-TACOMA INTL A | WA
SELECT id, call_sign, country_code, elevation, lat, long, name, state_codeFROM weather_stationWHERE id = '727930:24233';
Fields
Table Name
Primary Key: Partition Key Required
![Page 169: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/169.jpg)
Update
UPDATE weather_stationSET name = 'SeaTac International Airport'WHERE id = '727930:24233';
id | call_sign | country_code | elevation | lat | long | name | state_code--------------+-----------+--------------+-----------+--------+---------+------------------------------+------------727930:24233 | KSEA | US | 121.9 | 47.467 | -122.32 | SeaTac International Airport | WA
Table Name Fields to Update: Not in Primary Key
Primary Key
![Page 170: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/170.jpg)
Lightweight Transactions
UPDATE weather_stationSET name = 'SeaTac International Airport'WHERE id = ‘727930:24233’; IF name = 'SEATTLE SEATTLE-TACOMA INTL A’;
Don’t overwrite!
![Page 171: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/171.jpg)
Delete
DELETE FROM weather_stationWHERE id = '727930:24233';
Table Name
Primary Key: Required
![Page 172: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/172.jpg)
CollectionsSet
CREATE TABLE weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, equipment set<text> PRIMARY KEY(id) );
equipment set<text>
CQL Type: For Ordering
Column Name
![Page 173: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/173.jpg)
CollectionsSet
List
CREATE TABLE weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, equipment set<text>, service_dates list<timestamp>, PRIMARY KEY(id) );
equipment set<text>
service_dates list<timestamp>Column Name
CQL Type: For Ordering
Column Name
CQL Type
![Page 174: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/174.jpg)
CollectionsSet
List
Map
CREATE TABLE weather_station ( id text, name text, country_code text, state_code text, call_sign text, lat double, long double, elevation double, equipment set<text>, service_dates list<timestamp>, service_notes map<timestamp,text>, PRIMARY KEY(id) );
equipment set<text>
service_dates list<timestamp>
service_notes map<timestamp,text>
Column Name
Column Name
CQL Key Type CQL Value Type
CQL Type: For Ordering
Column Name
CQL Type
![Page 175: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/175.jpg)
User Defined Functions*
*As of Cassandra 2.2
•Built-in: avg, min, max, count(<column name>) •Runs on server •Always use with partition key
![Page 176: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/176.jpg)
User Defined Functions
CREATE FUNCTION maxI(current int, candidate int) CALLED ON NULL INPUTRETURNS int LANGUAGE java AS'if (current == null) return candidate; else return Math.max(current, candidate);' ; CREATE AGGREGATE maxAgg(int) SFUNC maxISTYPE intINITCOND null;
CQL Type
Pure Function
SELECT maxAgg(temperature) FROM raw_weather_dataWHERE wsid='10010:99999' AND year = 2005 AND month = 12 AND day = 1
Aggregate usingfunction overpartition
![Page 177: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/177.jpg)
4.2.1.1.2 Cassandra - Partitions and clustering
![Page 178: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/178.jpg)
Primary Key
CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
![Page 179: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/179.jpg)
Primary key relationship
PRIMARY KEY ((wsid),year,month,day,hour)
![Page 180: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/180.jpg)
Primary key relationship
Partition Key
PRIMARY KEY ((wsid),year,month,day,hour)
![Page 181: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/181.jpg)
Primary key relationship
PRIMARY KEY ((wsid),year,month,day,hour)
Partition Key Clustering Columns
![Page 182: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/182.jpg)
Primary key relationship
Partition Key Clustering Columns
10010:99999
PRIMARY KEY ((wsid),year,month,day,hour)
![Page 183: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/183.jpg)
2005:12:1:10
-5.6
Primary key relationship
Partition Key Clustering Columns
10010:99999-5.3-4.9-5.1
2005:12:1:9 2005:12:1:8 2005:12:1:7
PRIMARY KEY ((wsid),year,month,day,hour)
![Page 184: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/184.jpg)
Clustering
200510010:99999 12 1 10
200510010:99999 12 1 9
raw_weather_data
-5.6
-5.1
200510010:99999 12 1 8
200510010:99999 12 1 7
-4.9
-5.3
Order By
DESC
![Page 185: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/185.jpg)
Partition keys
10010:99999 Murmur3 Hash Token = 7224631062609997448
722266:13850 Murmur3 Hash Token = -6804302034103043898
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘722266:13850’,2005,12,1,7,-5.6);
Consistent hash. 128 bit number between 2-63 and 264
![Page 186: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/186.jpg)
Partition keys
10010:99999 Murmur3 Hash Token = 15
722266:13850 Murmur3 Hash Token = 77
For this example, let’s make it a reasonable number
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘722266:13850’,2005,12,1,7,-5.6);
![Page 187: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/187.jpg)
Data LocalityDC1
DC1: RF=3Node Primary Replica Replica
10.0.0.1 00-25 76-100 51-75
10.0.0.2 26-50 00-25 76-100
10.0.0.3 51-75 26-50 00-25
10.0.0.4 76-100 51-75 26-50
10.0.0.1 00-25
10.0.0.4 76-100
10.0.0.2 26-50
10.0.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
Client
Read partition 15
DC2
10.1.0.1 00-25
10.1.0.4 76-100
10.1.0.2 26-50
10.1.0.3 51-75
76-100 51-75
00-25 76-100
26-50 00-25
51-75 26-50
DC2: RF=3
Client
Read partition 15
Node Primary Replica Replica
10.1.0.1 00-25 76-100 51-75
10.1.0.2 26-50 00-25 76-100
10.1.0.3 51-75 26-50 00-25
10.1.0.4 76-100 51-75 26-50
![Page 188: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/188.jpg)
Data Locality
wsid=‘10010:99999’ ?
1000 Node Cluster
You are here!
![Page 189: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/189.jpg)
4.2.1.1.3 Cassandra - Read and Write Path
![Page 190: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/190.jpg)
WritesCREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
![Page 191: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/191.jpg)
WritesCREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,10,-5.6);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,9,-5.1);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,8,-4.9);
INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.3);
![Page 192: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/192.jpg)
Write PathClient INSERT INTO raw_weather_data(wsid,year,month,day,hour,temperature)
VALUES (‘10010:99999’,2005,12,1,7,-5.3);
year 1wsid 1 month 1 day 1 hour 1
year 2wsid 2 month 2 day 2 hour 2
Memtable
SSTable
SSTable
SSTable
SSTable
Node
Commit Log Data * Compaction *
Temp
Temp
![Page 193: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/193.jpg)
Storage Model - Logical View
2005:12:1:10
-5.6
2005:12:1:9
-5.1
2005:12:1:8
-4.9
10010:99999
10010:99999
10010:99999
wsid hour temperature
2005:12:1:7
-5.310010:99999
SELECT wsid, hour, temperatureFROM raw_weather_dataWHERE wsid=‘10010:99999’ AND year = 2005 AND month = 12 AND day = 1;
![Page 194: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/194.jpg)
2005:12:1:10
-5.6 -5.3-4.9-5.1
Storage Model - Disk Layout
2005:12:1:9 2005:12:1:810010:99999
2005:12:1:7
Merged, Sorted and Stored Sequentially
SELECT wsid, hour, temperatureFROM raw_weather_dataWHERE wsid=‘10010:99999’ AND year = 2005 AND month = 12 AND day = 1;
![Page 195: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/195.jpg)
2005:12:1:10
-5.6
2005:12:1:11
-4.9 -5.3-4.9-5.1
Storage Model - Disk Layout
2005:12:1:9 2005:12:1:810010:99999
2005:12:1:7
Merged, Sorted and Stored Sequentially
SELECT wsid, hour, temperatureFROM raw_weather_dataWHERE wsid=‘10010:99999’ AND year = 2005 AND month = 12 AND day = 1;
![Page 196: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/196.jpg)
2005:12:1:10
-5.6
2005:12:1:11
-4.9 -5.3-4.9-5.1
Storage Model - Disk Layout
2005:12:1:9 2005:12:1:810010:99999
2005:12:1:7
Merged, Sorted and Stored Sequentially
SELECT wsid, hour, temperatureFROM raw_weather_dataWHERE wsid=‘10010:99999’ AND year = 2005 AND month = 12 AND day = 1;
2005:12:1:12
-5.4
![Page 197: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/197.jpg)
Read PathClient
SSTableSSTable
SSTable
Node
Data
SELECT wsid,hour,temperatureFROM raw_weather_dataWHERE wsid='10010:99999'AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10;
year 1wsid 1 month 1 day 1 hour 1
year 2wsid 2 month 2 day 2 hour 2
Memtable
Temp
Temp
![Page 198: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/198.jpg)
Query patterns• Range queries • “Slice” operation on disk
Single seek on disk
10010:99999
Partition key for locality
SELECT wsid,hour,temperatureFROM raw_weather_dataWHERE wsid='10010:99999'AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10;
2005:12:1:10
-5.6 -5.3-4.9-5.1
2005:12:1:9 2005:12:1:8 2005:12:1:7
![Page 199: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/199.jpg)
Query patterns• Range queries • “Slice” operation on disk
Programmers like this
Sorted by event_time2005:12:1:10
-5.6
2005:12:1:9
-5.1
2005:12:1:8
-4.9
10010:99999
10010:99999
10010:99999
weather_station hour temperature
2005:12:1:7
-5.310010:99999
SELECT weatherstation,hour,temperature FROM temperature WHERE weatherstation_id=‘10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10;
![Page 200: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/200.jpg)
5.1 Spark and Cassandra - Architecture
![Page 201: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/201.jpg)
Great combo
Store a ton of data Analyze a ton of data
![Page 202: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/202.jpg)
Great combo
Spark Streaming
Near Real-time
SparkSQL
Structured Data
MLLib
Machine Learning
GraphX
Graph Analysis
![Page 203: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/203.jpg)
Great comboSpark Streaming
Near Real-time
SparkSQL
Structured Data
MLLib
Machine Learning
GraphX
Graph Analysis
CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
Spark Connector
![Page 204: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/204.jpg)
Executer
Master
Worker
Executer
Executer
Server
![Page 205: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/205.jpg)
Master
Worker
Worker
Worker Worker
0-24Token Ranges 0-100
25-49
50-74
75-99
I will only analyze 25% of the data.
![Page 206: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/206.jpg)
Master
Worker
Worker
Worker Worker
0-24
25-49
50-74
75-9975-99
0-24
25-49
50-74
AnalyticsTransactional
![Page 207: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/207.jpg)
Executer
Master
Worker
Executer
Executer
75-99
SELECT * FROM keyspace.table WHERE token(pk) > 75 AND token(pk) <= 99
Spark RDD
Spark Partition
Spark Partition
Spark Partition
Spark Connector
![Page 208: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/208.jpg)
Executer
Master
Worker
Executer
Executer
75-99
Spark RDD
Spark Partition
Spark Partition
Spark Partition
![Page 209: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/209.jpg)
Spark ConnectorCassandra
Cassandra + Spark
Joins and Unions No Yes
Transformations Limited Yes
Outside Data Integration
No Yes
Aggregations Limited Yes
![Page 210: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/210.jpg)
Type mappingCQL Type Scala Typeascii Stringbigint Longboolean Booleancounter Longdecimal BigDecimal, java.math.BigDecimaldouble Doublefloat Floatinet java.net.InetAddressint Intlist Vector, List, Iterable, Seq, IndexedSeq, java.util.Listmap Map, TreeMap, java.util.HashMapset Set, TreeSet, java.util.HashSettext, varchar Stringtimestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTimetimeuuid java.util.UUIDuuid java.util.UUIDvarint BigInt, java.math.BigInteger*nullable values Option
![Page 211: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/211.jpg)
Execution of jobsLocal Cluster
•Connect to localhost master
•Single system dev •Runs stand alone
•Connect to spark master IP
•Production configuration •Submit using spark-submit
![Page 212: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/212.jpg)
Summary
•Cassandra acts as the storage layer for Spark •Deploy in a mixed cluster configuration •Spark executors access Cassandra using the DataStax connector
•Deploy your jobs in either local or cluster modes
![Page 213: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/213.jpg)
5.2 Spark and Cassandra - Analyzing Cassandra Data
![Page 214: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/214.jpg)
Attaching to Spark and Cassandra
// Import Cassandra-specific functions on SparkContext and RDD objectsimport org.apache.spark.{SparkContext, SparkConf}import com.datastax.spark.connector._
/** The setMaster("local") lets us run & test the job right in our IDE */val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") .setMaster(“local[*]") .setAppName(getClass.getName) // Optionally .set("cassandra.username", "cassandra") .set("cassandra.password", “cassandra") val sc = new SparkContext(conf)
![Page 215: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/215.jpg)
Weather station example
CREATE TABLE raw_weather_data ( wsid text, year int, month int, day int, hour int, temperature double, dewpoint double, pressure double, wind_direction int, wind_speed double, sky_condition int, sky_condition_text text, one_hour_precip double, six_hour_precip double, PRIMARY KEY ((wsid), year, month, day, hour) ) WITH CLUSTERING ORDER BY (year DESC, month DESC, day DESC, hour DESC);
![Page 216: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/216.jpg)
Simple example
/** keyspace & table */val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data") /** get a simple count of all the rows in the raw_weather_data table */val rowCount = tableRDD.count()println(s"Total Rows in Raw Weather Table: $rowCount") sc.stop()
![Page 217: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/217.jpg)
Simple example/** keyspace & table */val tableRDD = sc.cassandraTable("isd_weather_data", "raw_weather_data") /** get a simple count of all the rows in the raw_weather_data table */val rowCount = tableRDD.count()println(s"Total Rows in Raw Weather Table: $rowCount") sc.stop()
Executer
SELECT * FROM isd_weather_data.raw_weather_data
Spark RDD
Spark Partition
Spark Connector
![Page 218: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/218.jpg)
Using CQL
SELECT temperatureFROM raw_weather_dataWHERE wsid = '724940:23234'AND year = 2008AND month = 12AND day = 1;
val cqlRRD = sc.cassandraTable("isd_weather_data", "raw_weather_data") .select("temperature") .where("wsid = ? AND year = ? AND month = ? AND DAY = ?", "724940:23234", "2008", "12", “1")
![Page 219: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/219.jpg)
Using SQL!
spark-sql> SELECT wsid, year, month, day, max(temperature) high, min(temperature) low FROM raw_weather_data WHERE month = 6 AND temperature !=0.0 GROUP BY wsid, year, month, day;
724940:23234 2008 6 1 15.6 10.0 724940:23234 2008 6 2 15.6 10.0 724940:23234 2008 6 3 17.2 11.7 724940:23234 2008 6 4 17.2 10.0 724940:23234 2008 6 5 17.8 10.0 724940:23234 2008 6 6 17.2 10.0 724940:23234 2008 6 7 20.6 8.9
![Page 220: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/220.jpg)
SQL with a Join
spark-sql> SELECT ws.name, raw.hour, raw.temperature FROM raw_weather_data raw JOIN weather_station ws ON raw.wsid = ws.id WHERE raw.wsid = '724940:23234' AND raw.year = 2008 AND raw.month = 6 AND raw.day = 1;
SAN FRANCISCO INTL AP 23 15.0 SAN FRANCISCO INTL AP 22 15.0 SAN FRANCISCO INTL AP 21 15.6 SAN FRANCISCO INTL AP 20 15.0 SAN FRANCISCO INTL AP 19 15.0 SAN FRANCISCO INTL AP 18 14.4
![Page 221: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/221.jpg)
Analyzing large data sets
val spanRDD = sc.cassandraTable[Double]("isd_weather_data", "raw_weather_data") .select("temperature") .where("wsid = ? AND year = ? AND month = ? AND DAY = ?", "724940:23234", "2008", "12", "1").spanBy(row => (row.getString("wsid")))
•Specify partition grouping •Use with large partitions •Perfect for time series
![Page 222: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/222.jpg)
Saving back the weather data
val cc = new CassandraSQLContext(sc)cc.setKeyspace("isd_weather_data") cc.sql(""" SELECT wsid, year, month, day, max(temperature) high, min(temperature) low FROM raw_weather_data WHERE month = 6 AND temperature !=0.0 GROUP BY wsid, year, month, day; """) .map{row => (row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5))} .saveToCassandra("isd_weather_data", "daily_aggregate_temperature")
![Page 223: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/223.jpg)
Guest speaker!
Chief Data Scientist Jon Haddad - Jon Haddad
![Page 224: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/224.jpg)
In the beginning… there was RDDsc = SparkContext(appName="PythonPi") partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 n = 100000 * partitions
def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 < 1 else 0
count = sc.parallelize(range(1, n + 1), partitions).\ map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
sc.stop()
![Page 225: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/225.jpg)
Why Not Python + RDDs?
RDDJavaGatewayServer
Py4JRDD
![Page 226: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/226.jpg)
DataFrames• Abstraction over RDDs • Modeled after Pandas & R • Structured data • Python passes commands only • Commands are pushed down • Data Never Leaves the JVM • You can still use the RDD if you
want • Dataframe.rdd
RDD
DataFrame
![Page 227: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/227.jpg)
Let's play with code
![Page 228: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/228.jpg)
Sample Dataset - Movielens• Subset of movies (1-100) • ~800k ratings
CREATE TABLE movielens.movie ( movie_id int PRIMARY KEY, genres set<text>, title text )
CREATE TABLE movielens.rating ( movie_id int, user_id int, rating decimal, ts int, PRIMARY KEY (movie_id, user_id) )
![Page 229: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/229.jpg)
Reading Cassandra Tables• DataFrames has a standard
interface for reading • Cache if you want to keep dataset
in memory
cl = "org.apache.spark.sql.cassandra"
movies = sql.read.format(cl).\ load(keyspace="movielens", table="movie").cache()
ratings = sql.read.format(cl).\ load(keyspace="movielens", table="rating").cache()
![Page 230: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/230.jpg)
Filtering• Select specific rows matching
various patterns • Fields do not require indexes • Filtering occurs in memory • You can use DSE Solr Search
Queries • Filtering returns a DataFrame
movies.filter(movies.movie_id == 1) movies[movies.movie_id == 1] movies.filter("movie_id=1")
movie_id title genres
44 Mortal Kombat (1995)['Action', 'Adventure', 'Fantasy']
movies.filter("title like '%Kombat%'")
![Page 231: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/231.jpg)
Filtering• Helper function: explode()
• select() to keep specific columns
• alias() to renametitle
Broken Arrow (1996)GoldenEye (1995)
Mortal Kombat (1995)
White Squall (1996)
Nick of Time (1995)
from pyspark.sql import functions as F movies.select("title", F.explode("genres").\ alias("genre")).\ filter("genre = 'Action'").select("title")
title genre
Broken Arrow (1996) Action
Broken Arrow (1996) Adventure
Broken Arrow (1996) Thriller
![Page 232: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/232.jpg)
Aggregation• Count, sum, avg • in SQL: GROUP BY • Useful with spark streaming • Aggregate raw data • Send to dashboards
ratings.groupBy("movie_id").\ agg(F.avg("rating").alias('avg'))
ratings.groupBy("movie_id").avg("rating")
movie_id avg
31 3.24
32 3.8823
33 3.021
![Page 233: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/233.jpg)
Joins• Inner join by default • Can do various outer joins
as well • Returns a new DF with all
the columns
ratings.join(movies, "movie_id")
DataFrame[movie_id: int, user_id: int,
rating: decimal(10,0), ts: int, genres: array<string>, title: string]
![Page 234: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/234.jpg)
Chaining Operations
• Similar to SQL, we can build up in complexity
• Combine joins with aggregations, limits & sorting
ratings.groupBy("movie_id").\ agg(F.avg("rating").\ alias('avg')).\ sort("avg", ascending=False).\ limit(3).\ join(movies, "movie_id").\ select("title", "avg")
title avg
Usual Suspects, The (1995) 4.32
Seven (a.k.a. Se7en) (1995) 4.054
Persuasion (1995) 4.053
![Page 235: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/235.jpg)
SparkSQL• Register DataFrame as Table • Query using HiveSQL syntax
movies.registerTempTable("movie") ratings.registerTempTable("rating") sql.sql("""select title, avg(rating) as avg_rating from movie join rating on movie.movie_id = rating.movie_id group by title order by avg_rating DESC limit 3""")
![Page 236: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/236.jpg)
Database Migrations• DataFrame reader supports JDBC • JOIN operations can be cross DB • Read dataframe from JDBC, write
to Cassandra
![Page 237: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/237.jpg)
Inter-DB Migration
from pyspark.sql import SQLContext sql = SQLContext(sc)
m_con = "jdbc:mysql://127.0.0.1:3307/movielens?user=root"
movies = sql.read.jdbc(m_con, "movielens.movies")
movies.write.format("org.apache.spark.sql.cassandra").\ options(table="movie", keyspace="lens").\ save(mode="append")
http://rustyrazorblade.com/2015/08/migrating-from-mysql-to-cassandra-using-spark/
![Page 238: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/238.jpg)
Visualization
• dataframe.toPandas()• Matplotlib • Seaborn (looks nicer) • Crunch big data in spark
![Page 239: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/239.jpg)
Jupyter Notebooks• Iterate quickly • Test ideas • Graph results
![Page 240: An Introduction to time series with Team Apache](https://reader035.vdocuments.us/reader035/viewer/2022062316/58ef87241a28ab3d1f8b45c1/html5/thumbnails/240.jpg)
Hands On!
https://github.com/killrweather/killrweather/wiki/7.-Spark-and-Cassandra-Exercises-for-KillrWeather-data
KillrWeather Wiki