![Page 1: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/1.jpg)
Building Scalable Data Pipelines
Evan Chan
![Page 2: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/2.jpg)
Who am IDistinguished Engineer, Tuplejump
@evanfchan
http://velvia.github.io
User and contributor to Spark since 0.9
Co-creator and maintainer of Spark Job Server
![Page 3: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/3.jpg)
TupleJump - Big Data Dev Partners 3
![Page 4: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/4.jpg)
Instant Gratification
I want insights now
I want to act on news right away
I want stuff personalized for me (?)
![Page 5: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/5.jpg)
Fast Data, notBig Data
![Page 6: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/6.jpg)
How Fast do you Need to Act?
Financial trading - milliseconds
Dashboards - seconds to minutes
BI / Reports - hours to days?
![Page 7: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/7.jpg)
What’s Your App?
Concurrent video viewers
Anomaly detection
Clickstream analysis
Live geospatial maps
Real-time trend detection & learning
![Page 8: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/8.jpg)
Common Components
Message Queue
EventsStream
Processing Layer
State / Database
Happy Users
![Page 9: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/9.jpg)
Example: Real-time trend detection
Events: time, OS, location, asset/product ID
Analyze 1-5 second batches of new “hot” data in stream processor
Combine with recent and historical top K feature vectors in database
Update database recent feature vectors
Serve to users
![Page 10: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/10.jpg)
Example 2: Smart Cities
![Page 11: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/11.jpg)
Smart City Streaming Data
City buses - regular telemetry (position, velocity, timestamp)
Street sweepers - regular telemetry
Transactions from rail, subway, buses, smart cards
311 info
911 info - new emergencies
![Page 12: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/12.jpg)
Citizens want to know…
Where and for how long can I park my car?
Are transportation options affected by 311 and 911 events?
How long will it take the next bus to get here?
Where is the closest bus to where I am?
![Page 13: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/13.jpg)
Cities want to know…
How can I maximize parking revenue?
More granular updates to parking spots that don't need sweeping
How does traffic affect waiting times in public transit, and revenue?
Patterns in subway train times - is a breakdown coming?
Population movement - where should new transit routes be placed?
![Page 14: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/14.jpg)
Message Queue
Stream Processing
Layer
Event storage
Ad-Hoc
311
911
Buses
MetroShort term telemetry
Models
Dashboard
![Page 15: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/15.jpg)
The HARD Principle
Highly Available, Resilient, Distributed
Flexibility - do as many transformations as possible with as few components as possible
Real-time: “NoETL”
Community: best of breed OSS projects with huge adoption and commercial support
![Page 16: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/16.jpg)
Message Queue
![Page 17: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/17.jpg)
Message Queue
EventsStream
Processing Layer
State / Database
Happy Users
![Page 18: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/18.jpg)
Why a message queue?
Centralized publish-subscribe of events
Need more processing? Add another consumer
Buffer traffic spikes
Replay events in cases of failure
![Page 19: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/19.jpg)
Message Queues help distribute data
A-F
G-M
N-S
T-Z
Input 1
Input 2
Input3
Input4
Processing
Processing
Processing
Processing
![Page 20: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/20.jpg)
Intro to Apache Kafka
Kafka is a distributed publish subscribe system
It uses a commit log to track changes
Kafka was originally created at LinkedIn
Open sourced in 2011
Graduated to a top-level Apache project in 2012
![Page 21: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/21.jpg)
On being HARDMany Big Data projects are open source implementations of closed source products
Unlike Hadoop, HBase or Cassandra, Kafka actually isn't a clone of an existing closed source product
The same codebase being used for years at LinkedIn answers the questions:
Does it scale?
Is it robust?
![Page 22: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/22.jpg)
Ad Hoc ETL
![Page 23: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/23.jpg)
Decoupled ETL
![Page 24: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/24.jpg)
Avro Schemas And Schema Registry
Keys and values in Kafka can be Strings or byte arrays
Avro is a serialization format used extensively with Kafka and Big Data
Kafka uses a Schema Registry to keep track of Avro schemas Verifies that the correct schemas are being used
![Page 25: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/25.jpg)
Consumer Groups
![Page 26: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/26.jpg)
Commit Logs
![Page 27: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/27.jpg)
Kafka Resources
Official docs - https://kafka.apache.org/documentation.html
Design section is really good read
http://www.confluent.io/product
Includes schema registry
![Page 28: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/28.jpg)
Stream Processing
![Page 29: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/29.jpg)
Message Queue
EventsStream
Processing Layer
State / Database
Happy Users
![Page 30: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/30.jpg)
Types of Stream Processors
Event by Event: Apache Storm, Apache Flink, Intel GearPump, Akka
Micro-batch: Apache Spark
Hybrid? Google Dataflow
![Page 31: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/31.jpg)
Apache Storm and Flink
Transform one message at a time
Very low latency
State and more complex analytics difficult
![Page 32: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/32.jpg)
Akka and Gearpump
Actor to actor messaging. Local state.
Used for extreme low latency (ad networks, etc)
Dynamically reconfigurable topology
Configurable fault tolerance and failure recovery
Cluster or local mode - you don’t always need distribution!
![Page 33: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/33.jpg)
Spark Streaming
Data processed as stream of micro batches
Higher latency (seconds), higher throughput, more complex analysis / ML possible
Same programming model as batch
![Page 34: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/34.jpg)
Why Spark?
file = spark.textFile("hdfs://...") file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }
![Page 35: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/35.jpg)
Spark Production Deployments
![Page 36: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/36.jpg)
Explosion of Specialized Systems
![Page 37: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/37.jpg)
Spark and Berkeley AMP Lab
![Page 38: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/38.jpg)
Benefits of Unified LibrariesOptimizations can be shared between libraries Core Project Tungsten MLlib
Shared statistics libraries Spark Streaming GC and memory management
![Page 39: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/39.jpg)
Mix and match modules
Easily go from DataFrames (SQL) to MLLib / statistics, for example:
scala> import org.apache.spark.mllib.stat.Statistics
scala> val numMentions = df.select("NumMentions").map(row => row.getInt(0).toDouble)numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848
scala> val numArticles = df.select("NumArticles").map(row => row.getInt(0).toDouble)numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848
scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")
![Page 40: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/40.jpg)
Spark Worker FailureRebuild RDD Partitions on Worker from Lineage
![Page 41: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/41.jpg)
Spark SQL & DataFrames
![Page 42: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/42.jpg)
DataFrames & Catalyst Optimizer
![Page 43: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/43.jpg)
Catalyst OptimizationsColumn and partition pruning (Column filters) Predicate pushdowns (Row filters)
![Page 44: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/44.jpg)
Spark SQL Data Sources APIEnables custom data sources to participate in SparkSQL = DataFrames + Catalyst Production Impls spark-csv (Databricks) spark-avro (Databricks) spark-cassandra-connector (DataStax) elasticsearch-hadoop (Elastic.co)
![Page 45: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/45.jpg)
Spark Streaming
![Page 46: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/46.jpg)
Streaming SourcesBasic: Files, Akka actors, queues of RDDs, Socket
Advanced
Kafka
Kinesis
Flume
Twitter firehose
![Page 47: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/47.jpg)
DStreams = micro-batches
![Page 48: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/48.jpg)
Streaming Fault ToleranceIncoming data is replicated to 1 other node Write Ahead Log for sources that support ACKs Checkpointing for recovery if Driver fails
![Page 49: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/49.jpg)
Direct Kafka Streaming: KafkaRDD
No single Receiver Parallelizable No Write Ahead Log Kafka *is* the Write Ahead Log! KafkaRDD stores Kafka offsets KafkaRDD partitions recover from offsets
![Page 50: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/50.jpg)
Spark MLlib & GraphX
![Page 51: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/51.jpg)
Spark MLlib Common AlgosClassifiers DecisionTree, RandomForest
Clustering K-Means, Streaming K-Means
Collaborative Filtering Alternating Least Squares (ALS)
![Page 52: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/52.jpg)
Spark Text Processing AlgosTF/IDF
LDA
Word2Vec
*Pro-Tip: Use Stanford CoreNLP!
![Page 53: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/53.jpg)
Spark ML PipelinesModeled after scikit-learn
![Page 54: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/54.jpg)
Spark GraphX
PageRank Top Influencers
Connected Components Measure of clusters
Triangle Counting Measure of cluster density
![Page 55: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/55.jpg)
Handling State
![Page 56: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/56.jpg)
Message Queue
EventsStream
Processing Layer
State / Database
Happy Users
![Page 57: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/57.jpg)
What Kind of State?
Non-persistent / in-memory: concurrent viewers
Short term: latest trends
Longer term: raw event & aggregate storage
ML Models, predictions, scored data
![Page 58: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/58.jpg)
Spark RDDs
Immutable, cache in memory and/or on disk
Spark Streaming: UpdateStateByKey
IndexedRDD - can update bits of data
Snapshotting for recovery
![Page 59: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/59.jpg)
•Massively Scalable• High Performance• Always On• Masterless
![Page 60: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/60.jpg)
Scale
Apache Cassandra• Scales Linearly to as many nodes as you need
• Scales whenever you need
![Page 61: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/61.jpg)
Performance
Apache Cassandra• It’s Fast • Built to sustain massive data insertion rates in irregular pattern spikes
![Page 62: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/62.jpg)
FaultTolerance
&Availability
Apache Cassandra• Automatic Replication • Multi Datacenter • Decentralized - no single point of failure • Survive regional outages • New nodes automatically add themselves to the cluster
• DataStax drivers automatically discover new nodes
![Page 63: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/63.jpg)
Architecture
Apache Cassandra• Distributed, Masterless Ring Architecture
• Network Topology Aware
• Flexible, Schemaless - your data
structure can evolve seamlessly over time
![Page 64: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/64.jpg)
To download:
https://cassandra.apache.org/download/
https://github.com/pcmanus/ccm
^ Highly recommended for local testing/cluster setup
![Page 65: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/65.jpg)
Cassandra Data Modeling
Primary key = (partition keys, clustering keys)
Fast queries = fetch single partition
Range scans by clustering key
Must model for query patterns
Clustering 1 Clustering 2 Clustering 3Partition 1Partition 2Partition 3
![Page 66: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/66.jpg)
City Bus Data Modeling Example
Primary key = (Bus UUID, timestamp)
Easy queries: location and speed of single bus for a range of time
Can also query most recent location + speed of all buses (slower)
1020 s 1010 s 1000 sBus A speed, GPSBus BBus C
![Page 67: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/67.jpg)
Using Cassandra for Short Term StorageIdea is store and read small values
Idempotent writes + huge write capacity = ideal for streaming ingestion
For example, store last few (latest + last N) snapshots of buses, taxi locations, recent traffic info
![Page 68: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/68.jpg)
But Mommy! What about longer term data?
![Page 69: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/69.jpg)
I need to read lots of data, fast!!
- Ad hoc analytics of events - More specialized / geospatial - Building ML models from
large quantities of data - Storing scored/classified data
from models - OLAP / Data Warehousing
![Page 70: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/70.jpg)
Can Cassandra Handle Batch?
Cassandra tables are much better at lots of small reads than big data scans
You CAN store data efficiently in C*
Files seem easier for long term storage and analysis
But are files compatible with streaming?
![Page 71: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/71.jpg)
Lambda Architecture
![Page 72: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/72.jpg)
Lambda is Hard and Expensive
Very high TCO - Many moving parts - KV store, real time, batch
Lots of monitoring, operations, headache
Running similar code in two places
Lower performance - lots of shuffling data, network hops, translating domain objects
Reconcile queries against two different places
![Page 73: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/73.jpg)
NoLambda
A unified system
Real-time processing and reprocessing
No ETLs
Fault tolerance
Everything is a stream
![Page 74: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/74.jpg)
Can Cassandra do batch and ad-hoc?Yes, it can be competitive with Hadoop actually….
If you know how to be creative with storing your data!
Tuplejump/SnackFS - HDFS for Cassandra
github.com/tuplejump/FiloDB - analytics database
Store your data using Protobuf / Avro / etc.
![Page 75: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/75.jpg)
Introduction to FiloDB
Efficient columnar storage - 5-10x better
Scan speeds competitive with Parquet - 100x faster than regular Cassandra tables
Very fine grained filtering for sub-second concurrent queries
Easy BI and ad-hoc analysis via Spark SQL/Dataframes (JDBC etc.)
Uses Cassandra for robust, proven storage
![Page 76: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/76.jpg)
Combining FiloDB + Cassandra
Regular Cassandra tables for highly concurrent, aggregate / key-value lookups (dashboards)
FiloDB + C* + Spark for efficient long term event storage
Ad hoc / SQL / BI
Data source for MLLib / building models
Data storage for classified / predicted / scored data
![Page 77: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/77.jpg)
Message Queue
EventsSpark
Streaming
Short term storage, K-V
Adhoc, SQL, ML
Cassandra
FiloDB: Events, ad-hoc, batch
Spark
Dashboards, maps
![Page 78: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/78.jpg)
Message Queue
EventsSpark
Streaming Models
Cassandra
FiloDB: Long term event storage
Spark Learned Data
![Page 79: Building Scalable Data Pipelines - 2016 DataPalooza Seattle](https://reader031.vdocuments.us/reader031/viewer/2022011721/587332191a28ab596c8b6ccf/html5/thumbnails/79.jpg)
FiloDB + CassandraRobust, peer to peer, proven storage platform
Use for short term snapshots, dashboards
Use for efficient long term event storage & ad hoc querying
Use as a source to build detailed models