![Page 1: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/1.jpg)
Wolfram Wingerath
Real-Time Processing Explained
A Survey of Storm, Samza, Spark & Flink
Architecture
![Page 2: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/2.jpg)
Research:• Real-Time Databases• Stream Processing• NoSQL & Cloud Databases• …
Practice: Backend-as-a-Service
Web CachingReal-Time Database
…
+•
•
•
•
www.baqend.com
About meWolfram Wingerath
PhD Thesis & Research
DistributedSystems
Engineer
![Page 3: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/3.jpg)
@baqendcom
[email protected] Systems Engineer
• Web & Data Management Workshops
• Performance Auditing• Implementation Services
Our Product
Speed Kit:• Accelerates Any Website• Pluggable• Easy Setup
test.speed-kit.com
Our Services
Who We Are
Wolfram Wingerath
![Page 4: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/4.jpg)
Outline
• Big Picture:• A Typical Data Pipeline• Processing Frameworks
• Processing Models:• Batch Processing• Stream Processing
• Streaming Architectures:• Lambda Architecture• Kappa Architecture• Typical Operators• Exemplary Use Case
Future DirectionsReal-Time Databases
IntroductionBig Data in Motion
System SurveyBig Data + Low Latency
∑
Wrap-UpSummary & Discussion
![Page 5: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/5.jpg)
IN PRACTICE
Scalable Data Processing
![Page 6: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/6.jpg)
ApplicationProcessingPersistence/Streaming Serving
Today‘s topic!
A Data Processing Pipeline
6
![Page 7: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/7.jpg)
Data processing frameworks hide complexities of scaling, e.g.:
• Deployment - code distribution, starting/stopping work
• Monitoring - health checks, application stats
• Scheduling - assigning work, rebalancing
• Fault-tolerance - restarting workers, rescheduling failed work
Data Processing FrameworksScale-Out Made Feasible
Scaling out
Running in cluster
Running on single node
7
![Page 8: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/8.jpg)
INTRODUCTION
Batch vs StreamProcessing
![Page 9: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/9.jpg)
low
late
ncy
high throughput
Big Data Processing FrameworksWhat are your options?
Amazon Elastic
MapReduce
Google Dataflow
What to use when?
9
![Page 10: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/10.jpg)
ApplicationBatch
(e.g. MapReduce)Persistence(e.g. HDFS)
Serving(e.g. HBase)
• Cost-effective & Efficient
• Easy to reason about: operating on complete data
But:
• High latency: jobs periodic (e.g. during night times)
Batch Processing„Volume“
10
![Page 11: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/11.jpg)
Stream Processing„Velocity“
• Low end-to-end latency
• Challenges:
• Long-running jobs - no downtime allowed
• Asynchronism - data may arrive delayed or out-of-order
• Incomplete input - algorithms operate on partial data
• More: fault-tolerance, state management, guarantees, …
Streaming(e.g. Kafka, Redis)
ApplicationServingReal-Time
(e.g. Storm)
11
![Page 12: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/12.jpg)
Lambda ArchitectureBatch(Dold) + Stream(DΔnow) ≈ Batch(Dall)
ApplicationBatchPersistence Serving
Real-Time
• Fast output (real-time)
• Data retention + reprocessing (batch)→ „eventually accurate“ merged views of real-time & batchTypical setups: Hadoop + Storm (→ Summingbird), Spark, Flink
• High complexity 2 code bases & 2 deployments
Nathan Marz, How to beat the CAP theorem (2011)http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
Streaming(e.g. Kafka, Redis)
12
![Page 13: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/13.jpg)
Kappa ArchitectureStream(Dall) = Batch(Dall)
Streaming + retention(e.g. Kafka, Kinesis)
• Simpler than Lambda Architecture
• Data retention for history
• Reasons against Kappa:
• Existing legacy batch system
• Special tools only for a particular batch processor
• Only incremental algorithms
Jay Kreps, Questioning the Lambda Architecture (2014)https://www.oreilly.com/ideas/questioning-the-lambda-architecture
ApplicationServingReal-Time
replay
13
![Page 14: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/14.jpg)
Typical Stream OperatorsExamples
Filter & Transform
https://www.infoq.com/presentations/stream-processors-databases 14
Group
Aggregates Windows
Filter Map GroupByKey
Tumbling
Sliding
SUM()
COUNT()
https://www.infoq.com/presentations/stream-processing-apache-flink
![Page 15: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/15.jpg)
Typical Use CaseExample from Yahoo!
https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at 15
Input
• Read Ad trackingdata from Kafka
Filter
• Discard uselessdata
Project
• Extract relevant fields
Group
• By Ad campaign
Window
• Ad views per 10-min-window
![Page 16: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/16.jpg)
Wrap-upData Processing
• Processing frameworks abstract from scaling issues
• Lambda Architecture: batch + stream processing• Kappa Architecture: stream-only processing
16
Batch processing• easy to reason about• extremely efficient• huge input-output
latency
Stream processing• quick results• purely incremental• potentially complex to
handle
![Page 17: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/17.jpg)
Outline
• Processing Models: Stream ↔ Batch
• Stream Processing Frameworks:• Storm• Trident• Samza• Flink• Other Systems
Future DirectionsReal-Time Databases
IntroductionBig Data in Motion
System SurveyBig Data + Low Latency
∑
Wrap-UpSummary & Discussion
![Page 18: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/18.jpg)
SURVEY
Popular Stream Processing Systems
![Page 19: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/19.jpg)
Processing ModelsBatch vs. Micro-Batch vs. Stream
low latency high throughput
stream batchmicro-batch
19
![Page 20: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/20.jpg)
Overview
◦ First production-ready, well-adopted stream processor
◦ Compatible: native Java API, Thrift, distributed RPC
◦ Low-level: no primitives for joins or aggregations
◦ Native stream processor: latency < 50 ms feasible
◦ Big users: Twitter, Yahoo!, Spotify, Baidu, Alibaba, …
History
◦ 2010: developed at BackType (acquired by Twitter)
◦ 2011: open-sourced
◦ 2014: Apache top-level project
Storm„Hadoop of real-time“
20
![Page 21: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/21.jpg)
Dataflow
Cycles!
Directed Acyclic Graphs (DAG):• Spouts: pull data into topology• Bolts: do processing, emit data• Asynchronous• Lineage can be tracked for each tuple
→ At-least-once has 2x messagingoverhead
21
![Page 22: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/22.jpg)
Cluster ArchitectureHow Storm Scales
22
NimbusZookeeper
SubmitTopology
SupervisorWorker Worker
Worker Worker
SupervisorWorker Worker
Worker Worker
Storm Slave Storm Slave
JVM for eachworker (runsspouts andbolts as tasks)
Handles coordination
Scheduling & Monitoring
![Page 23: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/23.jpg)
State ManagementRecover State on Failure
• In-memory or Redis-backed reliable state
• Synchronous state communication on the critical path
→ infeasible for large state
23
![Page 24: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/24.jpg)
Back PressureThrottling Ingestion on Overload
Approach: monitoring bolts‘ inbound buffer1. Exceeding high watermark → throttle!2. Falling below low watermark → full power!
1. too manytuples
3. tuples getreplayed
2. tuples time out and fail
24
![Page 25: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/25.jpg)
Overview:
◦ Abstraction layer on top of Storm
◦ Released in 2012 (Storm 0.8.0)
◦ Micro-batching
◦ New features:
High-level API: aggregations & joins
Strong ordering
Stateful exactly-once processing
Performance penalty
TridentStateful Stream Joining on Storm
25
![Page 26: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/26.jpg)
TridentPartitioned Micro-Batching
26
3 Parti-tions
3 BatchesIllustration taken from: “Storm applied”, Sean T. Allen et al.
![Page 27: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/27.jpg)
Overview◦ Co-developed with Kafka
→ Kappa Architecture
◦ Simple: only single-step jobs
◦ Local state
◦ Native stream processor: low latency
◦ Users: LinkedIn, Uber, Netflix, TripAdvisor, Optimizely, …
History◦ Developed at LinkedIn
◦ 2013: open-source (Apache Incubator)
◦ 2015: Apache top-level project
SamzaReal-Time on Top of Kafka
Illustration taken from: Jay Kreps, Questioning the Lambda Architecture (2014)https://www.oreilly.com/ideas/questioning-the-lambda-architecture (2017-03-02) 27
![Page 28: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/28.jpg)
DataflowSimple By Design
• Job: processing step (≈ Storm bolt)→ Robust→ But: often several jobs
• Task: job instance (parallelism)
• Message: single data item
• Output persisted in Kafka→ Easy data sharing→ Buffering (no back pressure!)→ But: Increased latency
• Ordering within partitions
• Task = Kafka partitions: not-elastic on purpose
Martin Kleppmann, Turning the database inside-out with Apache Samza (2015)https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/ (2017-02-23) 28
![Page 29: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/29.jpg)
SamzaLocal State
Illustrations taken from: Jay Kreps, Why local state is a fundamental primitive in stream processing (2014)https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing (2017-02-26)
Advantages of local state:
• Buffering→ No back pressure→ At-least-once delivery→ Simple recovery
• Fast lookups
29
Remote State Local State
![Page 30: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/30.jpg)
DataflowExample: Enriching a Clickstream
Example: the enrichedclickstream is available toevery team within the organization
Illustration taken from: Jay Kreps, Why local state is a fundamental primitive in stream processing (2014)https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing (2017-02-26) 30
![Page 31: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/31.jpg)
State ManagementStraightforward Recovery
Illustration taken from: Navina Ramesh, Apache Samza, LinkedIn’s Framework for Stream Processing (2015)https://thenewstack.io/apache-samza-linkedins-framework-for-stream-processing (2017-02-26) 31
![Page 32: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/32.jpg)
Overview◦ High-level API: immutable collections (RDDs)
◦ Community: 1000+ contributors in 2015
◦ Big users: Amazon, eBay, Yahoo!, IBM, Baidu, …
History◦ 2009: developed at UC Berkeley
◦ 2010: open-sourced
◦ 2014: Apache top-level project
Spark„MapReduce successor“
32
Core SQL MLlib GraphXSpark
Streaming
![Page 33: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/33.jpg)
Overview◦ High-level API: DStreams ( ̴Java 8 Streams)
◦ Micro-Batching: seconds of latency
◦ Rich features: stateful, exactly-once, elastic
History◦ 2011: start of development
◦ 2013: Spark Streaming becomes part of Spark Core
Spark Streaming
33
![Page 34: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/34.jpg)
Resilient Distributed Data set (RDD)
◦ Immutable collection & deterministic operations
◦ Lineage tracking: → state can be reproduced→ periodic checkpoints reduce recovery time
DStream: Discretized RDD
◦ RDDs are processed in order: no ordering within RDD
◦ RDD scheduling ̴50 ms → latency >100ms
Spark StreamingCore Abstraction: DStream
Illustration taken from: http://spark.apache.org/docs/latest/streaming-programming-guide.html#overview (2017-02-26) 34
![Page 35: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/35.jpg)
ExampleCounting Page Views
Zaharia, Matei, et al. "Discretized streams: Fault-tolerant streaming computation at scale." Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013.
35
pageViews = readStream("http://...", "1s")ones = pageViews.map(event => (event.url, 1))counts = ones.runningReduce((a, b) => a + b)
![Page 36: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/36.jpg)
Overview◦ Native stream processor: Latency <100ms feasible
◦ Abstract API for stream and batch processing, stateful, exactly-once delivery
◦ Many libraries: Table and SQL, CEP, Machine Learning , Gelly…
◦ Users: Alibaba, Ericsson, Otto Group, ResearchGate, Zalando…
History◦ 2010: start as Stratosphere at TU Berlin, HU Berlin, and HPI
Potsdam
◦ 2014: Apache Incubator, project renamed to Flink
◦ 2015: Apache top-level project
Flink
36
![Page 37: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/37.jpg)
ArchitectureStreaming + Batch
37
https://www.infoq.com/presentations/stream-processing-apache-flink
![Page 38: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/38.jpg)
Managed StateStreaming + Batch
38
https://www.infoq.com/presentations/stream-processing-apache-flink
• Automatic Backups of local state
• Stored in RocksDB, Savepoints written to HDFS
![Page 39: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/39.jpg)
Highlight: Fault ToleranceDistributed Snapshots
• Ordering within stream partitions• Periodic checkpoints• Recovery:
1. reset state to checkpoint2. replay data from there
39
Illustration taken from: https://ci.apache.org/projects/flink/flink-docs-release-1.2/internals/stream_checkpointing.html (2017-02-26)
Exactly-once
![Page 40: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/40.jpg)
Outline
• Comparison Matrix• Other Systems• One-Line Takeaway
Future DirectionsReal-Time Databases
IntroductionBig Data in Motion
System SurveyBig Data + Low Latency
∑
Wrap-UpSummary & Discussion
![Page 41: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/41.jpg)
WRAP UP
Side-by-sidecomparison
![Page 42: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/42.jpg)
Storm Trident SamzaSpark
StreamingFlink
(streaming)
StrictestGuarantee
at-least-once
exactly-once
at-least-once
exactly-once exactly-once
AchievableLatency
≪100 ms <100 ms <100 ms <1 second <100 ms
State Management
(small state)
(small state)
Processing Model
one-at-a-time
micro-batchone-at-a-
timemicro-batch
one-at-a-time
Backpressure no
(buffering)
Ordering betweenbatches
withinpartitions
betweenbatches
withinpartitions
Elasticity
Comparison
42
![Page 43: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/43.jpg)
PerformanceYahoo! Benchmark
43
“Storm […] and Flink […] show sub-second latencies at relatively high throughputs with Storm having the lowest99th percentile latency. Spark streaming […] supports high throughputs, but at a relatively higher latency.”
From https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
Based on real use case:◦ Filter and count ad impressions
◦ 10 minute windows
![Page 44: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/44.jpg)
And even more: Kinesis, Gearpump, MillWheel, Muppet, S4, Photon, …
Other Systems
44
Heron Apex Dataflow
BeamKafka
StreamsIBM InfoSphere
Streams
![Page 45: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/45.jpg)
Outline
• Real-Time Databases:• Why Push-Based
Database Queries?• Where Do Real-Time
Databases Fit in?• Comparison Matrix:
• Meteor• RethinkDB• Parse• Firebase• Baqend
• Use Cases at Baqend:• Query Caching• Real-Time Queries
Future DirectionsReal-Time Databases
IntroductionBig Data in Motion
System SurveyBig Data + Low Latency
∑
Wrap-UpSummary & Discussion
![Page 46: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/46.jpg)
REAL-TIME DBS
Combining databaseswith streaming
![Page 47: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/47.jpg)
Traditional DatabasesNo Request? No Data!
circular shapes ?
What‘s the current state?
Query maintenance: periodic polling→ Inefficient→ Slow
47
![Page 48: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/48.jpg)
SELECT name, x, y FROM PeopleWHERE x BETWEEN 0 AND 25AND y BETWEEN 0 AND 15
ORDER BY name ASC
A BC
x
y
Find people in Room B:
0 10 20
5
10
1.
2.
3.
Wolle (21/4)
5 15 25
15
Wolle (19/13)Wolle (16/5)Wolle (22/8)Flo (4/3)
Erik (5/10)Erik (10/3)Erik (15/11)Erik (5/10)
Push-Based Access For Evolving DomainsSelf-Maintaining Results
![Page 49: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/49.jpg)
Database Management
static collections
Stream Processing
ephemeralstreams
push-basedpull-based
Real-TimeDatabases
evolving collections
Data Management OverviewDBMS vs. Real-Time DB vs. Stream Processing
![Page 50: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/50.jpg)
MeteorPoll-and-Diff Oplog Tailing
RethinkDB Parse Firebase Baqend
Scales withwrite TP
Scales with no. of queries
?(100k connections)
Composite queries (AND/OR)
(AND In Firestore)
Sorted queries (single attribute)
Limit
Offset (value-based)
Real-Time DatabasesIn a Nutshell
![Page 51: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/51.jpg)
USE CASE
How This Makesthe Web Faster
![Page 52: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/52.jpg)
Presentationis loading
![Page 53: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/53.jpg)
Average: 9,3s
Why latency matters
Loading…
-1% Revenue
100 ms
-9% Visitors
400 ms500 ms
-20% Traffic
![Page 54: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/54.jpg)
The ProblemTwo Bottlenecks: Backend & Latency
High Latency
Backend
![Page 55: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/55.jpg)
Solution: Global CachingFresh Data from Ubiquitous Web Caches
Low Latency
Less Processing
![Page 56: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/56.jpg)
New Caching AlgorithmsSolve Consistency Problem
1 0 11 0 0 10
![Page 57: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/57.jpg)
New Caching AlgorithmsSolve Consistency Problem
1 0 11 0 0 10
F. Gessert, F. Bücklers, und N. Ritter, „ORESTES: a Scalable Database-as-a-Service Architecture for Low Latency“, in CloudDB 2014, 2014.
F. Gessert und F. Bücklers, „ORESTES: ein System für horizontal skalierbaren Zugriff auf Cloud-Datenbanken“, in Informatiktage 2013, 2013.
F. Gessert, M. Schaarschmidt, W. Wingerath, S. Friedrich, und N. Ritter, „The Cache Sketch: Revisiting Expiration-based Caching in the Age of Cloud Data Management“, in BTW 2015.
F. Gessert und F. Bücklers, Performanz- und Reaktivitätssteigerung von OODBMS vermittels der Web-Caching-Hierarchie. Bachelorarbeit, 2010.
F. Gessert und F. Bücklers, Kohärentes Web-Caching von Datenbankobjekten im Cloud Computing. Masterarbeit 2012.
W. Wingerath, S. Friedrich, und F. Gessert, „Who Watches the Watchmen? On the Lack of Validation in NoSQL Benchmarking“, in BTW 2015.
M. Schaarschmidt, F. Gessert, und N. Ritter, „Towards Automated PolyglotPersistence“, in BTW 2015.
S. Friedrich, W. Wingerath, F. Gessert, und N. Ritter, „NoSQL OLTP Benchmarking: A Survey“, in 44. Jahrestagung der Gesellschaft für Informatik, 2014, Bd. 232, S. 693–704.
F. Gessert, „Skalierbare NoSQL- und Cloud-Datenbanken in Forschung und Praxis“, BTW 2015
F. Gessert, N. Ritter „Scalable Data Management: NoSQL Data Stores in Research and Practice“, 32nd IEEE International Conference on Data Engineering, ICDE, 2016
W. Wingerath, F. Gessert, S. Friedrich, N. Ritter „Real-time stream processing for Big Data“, Big Data Analytics it - Information Technology, 2016
F. Gessert, W. Wingerath, S. Friedrich, N. Ritter “NoSQL Database Systems: A Survey and Decision Guidance”, Computer Science - Research and Development, 2016
F. Gessert, N. Ritter „Polyglot Persistence“, Datenbank Spektrum, 2016.
F. Gessert, S. Friedrich, W. Wingerath, M. Schaarschmidt, und N. Ritter, „Towards a Scalable and Unified REST API for Cloud Data Stores“, in 44. Jahrestagung der GI, Bd. 232, S. 723–734.
8 YearsResearch & Development
![Page 58: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/58.jpg)
How to detect changes toquery results:„Give me the most popularproducts that are in stock.“
Add
Change
Remove
InvaliDBInvalidating DB Queries
![Page 59: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/59.jpg)
Pub-Sub Pub-Sub
Going Real-TimeQuery Caching & Subscribing
Keeps data up-to-date60
App Server
![Page 60: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/60.jpg)
Match!
InvaliDBFilter Queries: Distributed Query Matching
Two-dimensional partitioning:• by Query• by Object→ scales with queries and writes
Implementation:• Apache Storm & Java• MongoDB query language• Pluggable engine
Subscription
Write op
61
![Page 61: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/61.jpg)
Linear Scalability Stable Latency Distribution
Baqend Real-Time QueriesLow Latency + Linear Scalability
Quaestor: Query Web Caching for Database-as-a-Service ProvidersVLDB ‘17
![Page 62: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/62.jpg)
var query = DB.Tweet.find().matches('text', /my filter/).descending('createdAt').offset(20).limit(10);
query.resultList(result => ...);
query.resultStream(result => ...);
Static Query
Real-Time Query
Programming Real-Time QueriesJavaScript API
![Page 63: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/63.jpg)
![Page 64: Real-Time Processing Explained - speed-kit-website€¦ · Real-Time Processing Explained A Survey of Storm, Samza, Spark & Flink Architecture. Research: •Real-Time Databases •Stream](https://reader034.vdocuments.us/reader034/viewer/2022042219/5ec584e324011650077b03bc/html5/thumbnails/64.jpg)
Stream Processors:
Real-Time Databases integerateStorage & Streaming
Learn more: slides.baqend.com
Summary
latency throughput