benchmark: bananas vs spark streaming
TRANSCRIPT
Copyright © 2016, AKUDA Labs 1
Abstract
Bananas is the fastest data stream processing system currently in production use and is
capable of remarkable throughputs at sub-millisecond latencies. Bananas was
developed by AKUDA Labs to solve the real-time pattern matching and event detection
problem and has performed exceptionally well since its conception. To quantify the
orders of magnitude by which Bananas outperforms the current streaming systems, a
suite of benchmarking tests was conducted to compare the performance of Bananas
against Spark Streaming. The results were significant for two main reasons: firstly due to
the near-ideal consistency of Bananas’ low latency at high throughput across varying
processing workloads, and secondly the inconsistency of the claims made by Spark
Streaming, in respect to high throughput with low latency. Probably the most
unexpected aspect was the amount of performance tuning that was necessary to enable
Spark streaming to perform as claimed, where it proved to be nearly impossible to
obtain high throughput at sub-second latency. The overall results, benchmark setup,
system architecture comparison, brief explanation for the differences in performance,
an overview of Bananas and plans for future tests are presented in the following
sections.
Copyright © 2016, AKUDA Labs 2
Overall Benchmark Results Defying common industry ‘wisdom’, Bananas displays effectively unchanged latencies at
increasing throughputs. Spark Streaming proved to be non-competitive, as it struggled
to maintain steady-state performance with bounded processing latencies at a fraction of
the throughput that Bananas handles with sub-millisecond latencies. As is evident from
Figure 1, Bananas’ latency remains consistently under a millisecond at increasing
throughput across different processing workloads. Spark Streaming does not provide as
reassuring an example of a modern high-throughput streaming system, as it exhibits
astoundingly high latencies for relatively low throughputs and is unable to harness the
complete processing power of multicore systems.
Figure 1: Throughput vs. Average Latency curves for Bananas and Spark Streaming.
( Inset) Bananas specif ic plot with Latency in microseconds
The observed differences in latency between Bananas and Spark Streaming are
noteworthy. For a throughput of approximately 100,000 packets/s, the latency of Spark
Streaming was more than 24,000 times that of Bananas. When latency bounds were
set at 3s for Spark Streaming—low enough to emulate real-time behavior and high
enough to ensure acceptable throughputs—the observed throughput for Bananas was
10 times that of Spark Streaming at sub-millisecond latencies, as shown in Figure 2.
Copyright © 2016, AKUDA Labs 3
Figure 2: Throughput comparison between Bananas, Spark Streaming Optimized for
latency and Spark Streaming with latency bound of 3s
Benchmark Setup The benchmark setup was designed to emulate a common real-world problem:
detection of specific patterns within streaming unstructured text data. Systems
designed for spam detection, log file event detection, and social media trends
monitoring need to solve this problem in real-time. The primary task for the systems was
to search for specific patterns within a large text repository, which for this benchmark
was chosen as the Complete Works of Shakespeare. The dataset was obtained from
Project Gutenberg and was minimally processed to remove meta-information, copyright
instructions and other such text that was not written by The Bard himself. The cleaned
dataset contained 111,396 lines, with the average line having 37 Unicode characters.
Forty-eight parallel text-processing pipelines were initialized that searched for one
among five different patterns and if a pattern was found, the text packet was forwarded
to the next stage. The benchmarks were conducted on four 16-core virtual machines
that were all Dell 815 systems with 16GB of RAM and 10Gbit connections. The setups
for Bananas and Spark are shown in Figure 3 along with the specific points where
latencies were calculated.
For Spark Streaming, 12 Kafka Topic partitions were setup to distribute data among the
four nodes. The receiver-based approach was used for Spark Streaming as it proved to
be more stable and flexible than the receiver-less approach. The incoming packets were
tagged with a timestamp upon arrival at the Kafka receiver. To optimize for latency,
block window sizes were varied from 50ms to 800ms and eventually the default of
200ms was found to work best. After processing was completed the data packets were
again tagged with a timestamp at the output end of the executor processes.
100000
10000
250
0! 10000! 20000! 30000! 40000! 50000! 60000! 70000! 80000! 90000! 100000!
Bananas
Spark Streaming (latency < 3s)
Spark Streaming (optimized for latency)
Throughput (packets/s)
Copyright © 2016, AKUDA Labs 4
Figure 3: Spark Streaming setup and Latency measurement endpoints
Bananas has a fundamentally different architecture than that of Spark Streaming,
therefore to ensure an equivalent comparison, the setup as shown in Figure 4 was
implemented for Bananas. Data packets were initially tagged with a timestamp upon
arrival at the Data Producer, and after passing through the Data Channel, Data
Classifiers, and Output Channel, were tagged with a timestamp a final time at the
aggregator end.
Figure 4: Bananas test setup and Latency endpoints
Copyright © 2016, AKUDA Labs 5
System Architecture Comparison Bananas and Spark Streaming have two fundamental architectural differences that
create the observed differences in throughputs and latencies: • Bananas processes each packet upon arrival whereas Spark Streaming creates
microbatches of data received over a specified batch time window and processes
the batch all together.
• Bananas is implemented through a lockless shared memory queue management
protocol that enables implementation of massively parallel processing pipelines.
Spark Streaming’s inherent parallelism is present in dividing the batch RDDs’ into
block time window partitions, and each partition is then processed in parallel. This
setup requires state management, extensive synchronization, data distribution and
aggregation processes that add latency.
Explanation of Results and Insights The biggest performance drawback for Spark Streaming is that to minimize latency to
sub-second levels, unacceptably low throughputs—of the order of a few hundred
packets per second—had to be tolerated along with scheduling delays of as much as a
minute to maintain system stability. To attain throughput in scales comparable to
Bananas much greater latencies had to be tolerated. Neither of these tradeoffs were
required for Bananas. Essentially, Spark Streaming showed latencies of tens of seconds—comparable to the Spark batch processing system—to sustain throughputs
comparable to Bananas without any backpressure. Improvement in throughput for the
Spark Streaming setup was achieved at the cost of extensive performance tuning with
regards to selection of appropriate batch and block intervals, optimal executor
topology, variation in throughput rates without observed backpressure and modifying
the number of Kafka partitions. The need for such extensive tuning indicates that Spark
Streaming is unsuitable for most high-performance real-time data stream processing
systems that require consistent performance under varying data traffic and changing
topologies. Conversely, Bananas delivers thoroughly consistent performance ideal for
real-world requirements.
Copyright © 2016, AKUDA Labs 6
Bananas Overview While Bananas works in a cluster across a configurable set of nodes, much like Spark
Streaming, it requires no expensive crosstalk between nodes. The nodes use lockless
shared memory data structures and careful manipulation of processor caches and
read/write ordering management to handle internal communication and filtering, and
the system overall employs a lightweight scheduler to load balance classifiers across the
nodes. Each node sends metrics of its own workload and processor utilization to the
scheduler, which uses the information to manage resource allocation and classifier
topology across the nodes.
Bananas relies on the use of the following techniques to meet its design requirements:
• Data flow pipelines with zero data replication
• Use of the high bisection bandwidth inter-core communication fabric inside a CPU
• Lockless multi-threaded data management and data processing algorithms. Typical
use of locks and/or semaphores for any type of synchronization, consistency
enforcement, or communication makes true linear speedups (increase of
performance versus number of threads) impossible.
• Fast, adaptive, predictive, multi-level topology and resource scheduler (sub-
second re-organization period).
Bananas leverages the communication fabric inside IA64 processors to produce high
bandwidth replication by bringing a document into the second-level processor cache,
and then maximizing the probability of satisfying requests from all other threads on all
other cores from the neighbor cache. The use of this type of “super-node” allows
Bananas to flatten a broadcasting tree, reducing both the cost and latency by a factor of
approximately 100x, and in some scenarios, over 1000x. (Performance gain factor
depends on exact type of classification algorithms being executed, partitioning
granularity and actual hardware.) These design and architectural decisions enable
Bananas was designed with high throughput as the primary goal— this ensured low
latency and greater proportion of CPU cycles dedicated to data processing instead of
message passing and inter-node synchronization operations.
Copyright © 2016, AKUDA Labs 7
Bananas to effectively maintain constant latencies with increasing throughputs. Figure 5
shows the throughput vs. average latency curves for Bananas with 48 and 96 data
classifiers working on the same 4-node cluster.
Figure 5: Bananas Throughput vs. Latency for Processing Load of 48 and 96 Classif iers on
the same test setup
As the workload on each Bananas worker is doubled the observed latencies naturally
increase, quite remarkably in the case of Bananas, by only a factor of two that
demonstrates almost linearly scalable performance. Also noteworthy is the effectively
constant latency across increasing throughput.
Conclusions • Bananas defies common industry wisdom and processes data at consistently high-
throughput and low-latency, both important criteria for a streaming system to meet
current and future data processing requirements.
• Spark Streaming is essentially an abstraction over the Spark Batch Processing system
and is unsuitable for practical streaming systems that require high-throughput while
performing computationally intensive tasks at sub-second latencies.
Copyright © 2016, AKUDA Labs 8
• Our results showed that a truly real-time system can never be one that batches data
and processes them in slices. Not only is significant time spent scheduling tasks, but
also there is an inherent risk of backpressure and the inflexibility of modifying time
windows to withstand temporal variations in data traffic.
Future Tests These tests demonstrate the overall performance advantage of Bananas over Spark
Streaming in a real use-case scenario. There are several other possible tests that would
verify the versatility and adaptability of Bananas under a variety of processing
workloads. A subset of these tests will be performed in the future, including but not
limited to: • Comparison with other popular Open-Source streaming solutions, including Apache
Storm, Apache Flink, Apache Tez, Apache Samza, Apache Apex and Google Cloud
Dataflow
• Scalability tests over multiple servers with both greater and fewer cores than in the
current test case
• Machine Learning tasks that involve significant CPU intensive matrix calculations and
data aggregation
• Word-count, K-Top words, Grep, and other such test cases used to benchmark
Spark Streaming in published papers and reports.