making big data analytics interactive and real-time
DESCRIPTION
TRANSCRIPT
![Page 1: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/1.jpg)
Spark Making Big Data Analytics Interactive and Real-‐Time
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Timothy Hunter, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica spark-‐project.org
UC BERKELEY
![Page 2: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/2.jpg)
Overview Spark is a parallel framework that provides: » Efficient primitives for in-‐memory data sharing » Simple APIs in Scala, Java, SQL » High generality (applicable to many emerging problems)
This talk will cover: » What it does » How people are using it (including some surprises) » Current research
![Page 3: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/3.jpg)
Motivation MapReduce simplified data analysis on large, unreliable clusters
But as soon as it got popular, users wanted more: » More complex, multi-‐pass applications (e.g. machine learning, graph algorithms) » More interactive ad-‐hoc queries » More real-‐time stream processing
One reaction: specialized models for some of these apps (e.g. Pregel, Storm)
![Page 4: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/4.jpg)
Motivation Complex apps, streaming, and interactive queries all need one thing that MapReduce lacks:
Efficient primitives for data sharing
![Page 5: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/5.jpg)
Examples
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
HDFS read
HDFS write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFS read
Slow due to replication and disk I/O, but necessary for fault tolerance
![Page 6: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/6.jpg)
iter. 1 iter. 2 . . .
Input
Goal: Sharing at Memory Speed
Input
query 1
query 2
query 3
. . .
one-‐time processing
10-‐100× faster than network/disk, but how to get FT?
![Page 7: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/7.jpg)
Challenge
How to design a distributed memory abstraction that is both fault-‐tolerant and efficient?
![Page 8: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/8.jpg)
Existing Systems Existing in-‐memory storage systems have interfaces based on fine-‐grained updates » Reads and writes to cells in a table » E.g. databases, key-‐value stores, distributed memory
Requires replicating data or logs across nodes for fault tolerance è expensive! » 10-‐100x slower than memory write…
![Page 9: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/9.jpg)
Solution: Resilient Distributed Datasets (RDDs)
Provide an interface based on coarse-‐grained operations (map, group-‐by, join, …)
Efficient fault recovery using lineage » Log one operation to apply to many elements » Recompute lost partitions on failure » No cost if nothing fails
![Page 10: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/10.jpg)
Input
query 1
query 2
query 3
. . .
RDD Recovery
one-‐time processing
iter. 1 iter. 2 . . .
Input
![Page 11: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/11.jpg)
Generality of RDDs RDDs can express surprisingly many parallel algorithms » These naturally apply same operation to many items
Capture many current programming models » Data flow models: MapReduce, Dryad, SQL, … » Specialized models for iterative apps: Pregel, iterative MapReduce, PowerGraph, …
Allow these models to be composed
![Page 12: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/12.jpg)
Outline Programming interface
Examples
User applications
Implementation
Demo
Current research: Spark Streaming
![Page 13: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/13.jpg)
Spark Programming Interface
Language-‐integrated API in Scala*
Provides: » Resilient distributed datasets (RDDs) • Partitioned collections with controllable caching
» Operations on RDDs • Transformations (define RDDs), actions (compute results)
» Restricted shared variables (broadcast, accumulators)
*Also Java, SQL and soon Python
![Page 14: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/14.jpg)
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Msgs. 1
Msgs. 2
Msgs. 3
Base RDD Transformed RDD
Action
Result: full-‐text search of Wikipedia in <1 sec (vs 20 sec for on-‐disk data) Result: scaled to 1 TB data in 5-‐7 sec
(vs 170 sec for on-‐disk data)
![Page 15: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/15.jpg)
RDDs track the graph of transformations that built them (their lineage) to rebuild lost data
E.g.:
messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))
HadoopRDD
path = hdfs://…
FilteredRDD
func = _.contains(...)
MappedRDD
func = _.split(…)
Fault Recovery
![Page 16: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/16.jpg)
Fault Recovery Results
119
57 56 58 58 81
57 59 57 59
0 20 40 60 80 100 120 140
1 2 3 4 5 6 7 8 9 10
Iteratrion
time (s)
Iteration
Failure happens
![Page 17: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/17.jpg)
Example: Logistic Regression
Goal: find best line separating two sets of points
+
–
+ + +
+
+
+ + +
– – –
–
–
– – –
+
target
–
random initial line
![Page 18: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/18.jpg)
Example: Logistic Regression val data = spark.textFile(...).map(readPoint).persist() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)
![Page 19: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/19.jpg)
Logistic Regression Performance
110 s / iteration
first iteration 80 s further iterations 6 s 0
10
20
30
40
50
60
1 10 20 30
Run
ning
Tim
e (m
in)
Number of Iterations
Hadoop
Spark
![Page 20: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/20.jpg)
Example: Collaborative Filtering
Goal: predict users’ movie ratings based on past ratings of other movies
R =
1 ? ? 4 5 ? 3 ? ? 3 5 ? ? 3 5 ? 5 ? ? ? 1 4 ? ? ? ? 2 ?
Movies
Users
![Page 21: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/21.jpg)
Model and Algorithm Model R as product of user and movie feature matrices A and B of size U×K and M×K
Alternating Least Squares (ALS) » Start with random A & B » Optimize user vectors (A) based on movies » Optimize movie vectors (B) based on users » Repeat until converged
R A = BT
![Page 22: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/22.jpg)
Serial ALS var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = (0 until U).map(i => updateUser(i, B, R)) B = (0 until M).map(i => updateMovie(i, A, R)) }
Range objects
![Page 23: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/23.jpg)
Naïve Spark ALS var R = readRatingsMatrix(...) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R)) .collect() }
Problem: R re-‐sent
to all nodes in each iteration
![Page 24: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/24.jpg)
Efficient Spark ALS var R = spark.broadcast(readRatingsMatrix(...)) var A = // array of U random vectors var B = // array of M random vectors for (i <- 1 to ITERATIONS) { A = spark.parallelize(0 until U, numSlices) .map(i => updateUser(i, B, R.value)) .collect() B = spark.parallelize(0 until M, numSlices) .map(i => updateMovie(i, A, R.value)) .collect() }
Solution: mark R as broadcast variable
Result: 3× performance improvement
![Page 25: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/25.jpg)
Scaling Up Broadcast Initial version (HDFS) Cornet P2P broadcast
0
50
100
150
200
250
10 30 60 90
Iteration time (s)
Number of machines
Communication Computation
0
50
100
150
200
250
10 30 60 90
Iteration time (s)
Number of machines
Communication Computation
[Chowdhury et al, SIGCOMM 2011]
![Page 26: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/26.jpg)
Other RDD Operations
Transformations (define a new RDD)
map filter
sample groupByKey reduceByKey sortByKey
flatMap union join
cogroup cross …
Actions (return a result to driver program)
collect reduce count save …
![Page 27: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/27.jpg)
Spark in Java lines.filter(_.contains(“error”)).count()
JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
![Page 28: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/28.jpg)
Spark in Python (Coming Soon!)
lines = sc.textFile(sys.argv[1]) counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x, y: x + y)
![Page 29: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/29.jpg)
Outline Programming interface
Examples
User applications
Implementation
Demo
Current research: Spark Streaming
![Page 30: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/30.jpg)
Spark Users
400+ user meetup, 20+ contributors
![Page 31: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/31.jpg)
User Applications Crowdsourced traffic estimation (Mobile Millennium)
Video analytics & anomaly detection (Conviva)
Ad-‐hoc queries from web app (Quantifind)
Twitter spam classification (Monarch)
DNA sequence analysis (SNAP)
…
![Page 32: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/32.jpg)
Mobile Millennium Project Estimate city traffic from GPS-‐equipped vehicles (e.g. SF taxis)
![Page 33: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/33.jpg)
Sample Data
Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu
![Page 34: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/34.jpg)
Challenge Data is noisy and sparse (1 sample/minute)
Must infer path taken by each vehicle in addition to travel time distribution on each link
![Page 35: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/35.jpg)
Challenge Data is noisy and sparse (1 sample/minute)
Must infer path taken by each vehicle in addition to travel time distribution on each link
![Page 36: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/36.jpg)
Solution EM algorithm to estimate paths and travel time distributions simultaneously
observations
weighted path samples
link parameters
flatMap
groupByKey
broadcast
![Page 37: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/37.jpg)
Results
3× speedup from caching, 4.5x from broadcast
[Hunter et al, SOCC 2011]
![Page 38: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/38.jpg)
Conviva GeoReport
SQL aggregations on many keys w/ same filter
40× gain over Hive from avoiding repeated I/O, deserialization and filtering
0.5
20
0 5 10 15 20
Spark
Hive
Time (hours)
![Page 39: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/39.jpg)
Other Programming Models
Pregel on Spark (Bagel) » 200 lines of code
Iterative MapReduce » 200 lines of code
Hive on Spark (Shark) » 5000 lines of code » Compatible with Apache Hive » Machine learning ops. in Scala
![Page 40: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/40.jpg)
Outline Programming interface
Examples
User applications
Implementation
Demo
Current research: Spark Streaming
![Page 41: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/41.jpg)
Implementation Runs on Apache Mesos cluster manager to coexist w/ Hadoop
Supports any Hadoop storage system (HDFS, HBase, …)
Spark Hadoop MPI
Mesos
Node Node Node
…
Easy local mode and EC2 launch scripts
No changes to Scala
![Page 42: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/42.jpg)
Task Scheduler Runs general DAGs
Pipelines functions within a stage
Cache-‐aware data reuse & locality
Partitioning-‐aware to avoid shuffles
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
![Page 43: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/43.jpg)
Language Integration Scala closures are Serializable Java objects » Serialize on master, load & run on workers
Not quite enough » Nested closures may reference entire outer scope, pulling in non-‐Serializable variables not used inside » Solution: bytecode analysis + reflection
![Page 44: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/44.jpg)
Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the command line » Track variables that each line depends on » Ship generated classes to workers
Enables in-‐memory exploration of big data
![Page 45: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/45.jpg)
Outline Programming interface
Examples
User applications
Implementation
Demo
Current research: Spark Streaming
![Page 46: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/46.jpg)
Outline Programming interface
Examples
User applications
Implementation
Demo
Current research: Spark Streaming
![Page 47: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/47.jpg)
Many “big data” apps need to work in real time » Site statistics, spam filtering, intrusion detection, …
To scale to 100s of nodes, need: » Fault-‐tolerance: for both crashes and stragglers » Efficiency: don’t consume many resources beyond base processing
Challenging in existing streaming systems
Motivation
![Page 48: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/48.jpg)
Traditional Streaming Systems
Continuous processing model » Each node has long-‐lived state » For each record, update state & send new records
mutable state
node 1
node 3
input records push
node 2 input records
![Page 49: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/49.jpg)
Traditional Streaming Systems Fault tolerance via replication or upstream backup:
node 1
node 2
node 1’
node 3’
node 2’
synchronization
node 1
node 3
node 2
standby
input
input
input
input
node 3
![Page 50: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/50.jpg)
Traditional Streaming Systems Fault tolerance via replication or upstream backup:
node 1
node 3
node 2
node 1’
node 3’
node 2’
synchronization
node 1
node 3
node 2
standby
input
input
input
input
Fast recovery, but 2x hardware cost
Only need 1 standby, but slow to recover
![Page 51: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/51.jpg)
Traditional Streaming Systems Fault tolerance via replication or upstream backup:
node 1
node 3
node 2
node 1’
node 3’
node 2’
synchronization
node 1
node 3
node 2
standby
input
input
input
input
[Borealis, Flux] [Hwang et al, 2005]
Neither approach can handle stragglers
![Page 52: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/52.jpg)
Observation Batch processing models, such as MapReduce, do provide fault tolerance efficiently » Divide job into deterministic tasks » Rerun failed/slow tasks in parallel on other nodes
Idea: run streaming computations as a series of small, deterministic batch jobs » Same recovery schemes at much smaller timescale » To make latency low, store state in RDDs
![Page 53: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/53.jpg)
Discretized Stream Processing
t = 1:
t = 2:
batch operation
pull input …
…
input
immutable dataset (stored reliably)
immutable dataset (output or state); stored as RDD
![Page 54: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/54.jpg)
Fault Recovery Checkpoint state RDDs periodically
If a node fails/straggles, rebuild lost RDD partitions in parallel on other nodes
map
input dataset output dataset
Faster recovery than upstream backup, without the cost of replication
![Page 55: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/55.jpg)
How Fast Can It Go? Can process over 60M records/s (6 GB/s) on 100 nodes at sub-‐second latency
Max throughput under a given latency (1 or 2s)
0
20
40
60
80
0 50 100
Rec
ords
/s (m
illions
)
Nodes in Cluster
Grep
1 sec 2 sec 0
10
20
30
40
0 50 100
Rec
ords
/s (m
illions
)
Nodes in Cluster
Top K Words
1 sec 2 sec
![Page 56: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/56.jpg)
Comparison with Storm
Storm limited to 100K records/s/node
Also tried S4: 10K records/s/node
Commercial systems: O(500K) total
0
10
20
30
100 1000
MB/s per nod
e Record Size (bytes)
Top K Words Spark Storm
0
20
40
60
80
100 1000
MB/s per nod
e
Record Size (bytes)
Grep Spark Storm
Lack Spark’s FT guarantees
![Page 57: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/57.jpg)
How Fast Can It Recover? Recovers from faults/stragglers within 1 second
Sliding WordCount on 20 nodes with 10s checkpoint interval
0
0.5
1
1.5
0 5 10 15 20 25
Interval Proce
ssing
Time (s)
Time (s)
Failure happens
![Page 58: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/58.jpg)
Programming Interface Extension to Spark: Spark Streaming » All Spark operators plus new “stateful” ones
// Running count of pageviews by URL
views = readStream("http:...", "1s")
ones = views.map(ev => (ev.url, 1))
counts = ones.runningReduce(_ + _)
t = 1:
t = 2:
views ones counts
map reduce
. . .
= RDD = partition
“Discretized stream” (D-‐stream)
![Page 59: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/59.jpg)
Incremental Operators words.reduceByWindow(“5s”, max)
words interval max
sliding max
t-‐1
t
t+1
t+2
t+3
t+4 max
Associative function
words interval counts
sliding counts
t-‐1
t
t+1
t+2
t+3
t+4 +
+
–
words.reduceByWindow(“5s”, _+_, _-_)
Associative & invertible
![Page 60: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/60.jpg)
Applications Conviva video dashboard
Mobile Millennium traffic estimation
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
0 16 32 48 64
Active vide
o se
ssions
(m
illions
)
Nodes in Cluster
0
500
1000
1500
2000
0 20 40 60 80 GPS ob
servations
/ se
c
Nodes in Cluster
(online EM algorithm) (>50 session-‐level metrics)
![Page 61: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/61.jpg)
Unifying Streaming and Batch
D-‐streams and RDDs can seamlessly be combined » Same execution and fault recovery models
Enables powerful features: » Combining streams with historical data:
! pageViews.join(historicCounts).map(...)
» Interactive ad-‐hoc queries on stream state: ! pageViews.slice(“21:00”,“21:05”).topK(10)
used in MM application
used in Conviva app
![Page 62: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/62.jpg)
Benefits of a Unified Stack
Write each algorithm only once
Reuse data across streaming & batch jobs
Query stream state instead of waiting for import
Some users were doing this manually! » Conviva anomaly detection, Quantifind dashboard
![Page 63: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/63.jpg)
Conclusion “Big data” is moving beyond one-‐pass batch jobs, to low-‐latency apps that need data sharing
RDDs offer fault-‐tolerant sharing at memory speed
Spark uses them to combine streaming, batch & interactive analytics in one system
www.spark-‐project.org
![Page 64: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/64.jpg)
Related Work DryadLINQ, FlumeJava » Similar “distributed collection” API, but cannot reuse datasets efficiently across queries
GraphLab, Piccolo, BigTable, RAMCloud » Fine-‐grained writes requiring replication or checkpoints
Iterative MapReduce (e.g. Twister, HaLoop) » Implicit data sharing for a fixed computation pattern
Relational databases » Lineage/provenance, logical logging, materialized views
Caching systems (e.g. Nectar) » Store data in files, no explicit control over what is cached
![Page 65: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/65.jpg)
Behavior with Not Enough RAM
68.8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
Cache disabled
25% 50% 75% Fully cached
Iteration time (s)
% of working set in memory
![Page 66: Making Big Data Analytics Interactive and Real-Time](https://reader034.vdocuments.us/reader034/viewer/2022052410/54c6f7154a79596e7b8b45cd/html5/thumbnails/66.jpg)
RDDs for Debugging Debugging general distributed apps is very hard
However, Spark and other recent frameworks run deterministic tasks for fault tolerance
Leverage this determinism for debugging: » Log lineage for all RDDs created (small) » Let user replay any task in jdb, rebuild any RDD to query it interactively, or check assertions