benchmark: bananas vs spark streaming

Copyright © 2016, AKUDA Labs 1

Abstract

Bananas is the fastest data stream processing system currently in production use and is

capable of remarkable throughputs at sub-millisecond latencies. Bananas was

developed by AKUDA Labs to solve the real-time pattern matching and event detection

problem and has performed exceptionally well since its conception. To quantify the

orders of magnitude by which Bananas outperforms the current streaming systems, a

suite of benchmarking tests was conducted to compare the performance of Bananas

against Spark Streaming. The results were significant for two main reasons: firstly due to

the near-ideal consistency of Bananas’ low latency at high throughput across varying

processing workloads, and secondly the inconsistency of the claims made by Spark

Streaming, in respect to high throughput with low latency. Probably the most

unexpected aspect was the amount of performance tuning that was necessary to enable

Spark streaming to perform as claimed, where it proved to be nearly impossible to

obtain high throughput at sub-second latency. The overall results, benchmark setup,

system architecture comparison, brief explanation for the differences in performance,

an overview of Bananas and plans for future tests are presented in the following

sections.


Overall Benchmark Results Defying common industry ‘wisdom’, Bananas displays effectively unchanged latencies at

increasing throughputs. Spark Streaming proved to be non-competitive, as it struggled

to maintain steady-state performance with bounded processing latencies at a fraction of

the throughput that Bananas handles with sub-millisecond latencies. As is evident from

Figure 1, Bananas’ latency remains consistently under a millisecond at increasing

throughput across different processing workloads. Spark Streaming does not provide as

reassuring an example of a modern high-throughput streaming system, as it exhibits

astoundingly high latencies for relatively low throughputs and is unable to harness the

complete processing power of multicore systems.

Figure 1: Throughput vs. Average Latency curves for Bananas and Spark Streaming.

( Inset) Bananas specif ic plot with Latency in microseconds

The observed differences in latency between Bananas and Spark Streaming are

noteworthy. For a throughput of approximately 100,000 packets/s, the latency of Spark

Streaming was more than 24,000 times that of Bananas. When latency bounds were

set at 3s for Spark Streaming—low enough to emulate real-time behavior and high

enough to ensure acceptable throughputs—the observed throughput for Bananas was

10 times that of Spark Streaming at sub-millisecond latencies, as shown in Figure 2.


Figure 2: Throughput comparison between Bananas, Spark Streaming Optimized for

latency and Spark Streaming with latency bound of 3s

Benchmark Setup The benchmark setup was designed to emulate a common real-world problem:

detection of specific patterns within streaming unstructured text data. Systems

designed for spam detection, log file event detection, and social media trends

monitoring need to solve this problem in real-time. The primary task for the systems was

to search for specific patterns within a large text repository, which for this benchmark

was chosen as the Complete Works of Shakespeare. The dataset was obtained from

Project Gutenberg and was minimally processed to remove meta-information, copyright

instructions and other such text that was not written by The Bard himself. The cleaned

dataset contained 111,396 lines, with the average line having 37 Unicode characters.

Forty-eight parallel text-processing pipelines were initialized that searched for one

among five different patterns and if a pattern was found, the text packet was forwarded

to the next stage. The benchmarks were conducted on four 16-core virtual machines

that were all Dell 815 systems with 16GB of RAM and 10Gbit connections. The setups

for Bananas and Spark are shown in Figure 3 along with the specific points where

latencies were calculated.

For Spark Streaming, 12 Kafka Topic partitions were setup to distribute data among the

four nodes. The receiver-based approach was used for Spark Streaming as it proved to

be more stable and flexible than the receiver-less approach. The incoming packets were

tagged with a timestamp upon arrival at the Kafka receiver. To optimize for latency,

block window sizes were varied from 50ms to 800ms and eventually the default of

200ms was found to work best. After processing was completed the data packets were

again tagged with a timestamp at the output end of the executor processes.

100000

10000

250

0! 10000! 20000! 30000! 40000! 50000! 60000! 70000! 80000! 90000! 100000!

Bananas

Spark Streaming (latency < 3s)

Spark Streaming (optimized for latency)

Throughput (packets/s)


Figure 3: Spark Streaming setup and Latency measurement endpoints

Bananas has a fundamentally different architecture than that of Spark Streaming,

therefore to ensure an equivalent comparison, the setup as shown in Figure 4 was

implemented for Bananas. Data packets were initially tagged with a timestamp upon

arrival at the Data Producer, and after passing through the Data Channel, Data

Classifiers, and Output Channel, were tagged with a timestamp a final time at the

aggregator end.

Figure 4: Bananas test setup and Latency endpoints


System Architecture Comparison Bananas and Spark Streaming have two fundamental architectural differences that

create the observed differences in throughputs and latencies: • Bananas processes each packet upon arrival whereas Spark Streaming creates

microbatches of data received over a specified batch time window and processes

the batch all together.

• Bananas is implemented through a lockless shared memory queue management

protocol that enables implementation of massively parallel processing pipelines.

Spark Streaming’s inherent parallelism is present in dividing the batch RDDs’ into

block time window partitions, and each partition is then processed in parallel. This

setup requires state management, extensive synchronization, data distribution and

aggregation processes that add latency.

Explanation of Results and Insights The biggest performance drawback for Spark Streaming is that to minimize latency to

sub-second levels, unacceptably low throughputs—of the order of a few hundred

packets per second—had to be tolerated along with scheduling delays of as much as a

minute to maintain system stability. To attain throughput in scales comparable to

Bananas much greater latencies had to be tolerated. Neither of these tradeoffs were

required for Bananas. Essentially, Spark Streaming showed latencies of tens of seconds—comparable to the Spark batch processing system—to sustain throughputs

comparable to Bananas without any backpressure. Improvement in throughput for the

Spark Streaming setup was achieved at the cost of extensive performance tuning with

regards to selection of appropriate batch and block intervals, optimal executor

topology, variation in throughput rates without observed backpressure and modifying

the number of Kafka partitions. The need for such extensive tuning indicates that Spark

Streaming is unsuitable for most high-performance real-time data stream processing

systems that require consistent performance under varying data traffic and changing

topologies. Conversely, Bananas delivers thoroughly consistent performance ideal for

real-world requirements.


Bananas Overview While Bananas works in a cluster across a configurable set of nodes, much like Spark

Streaming, it requires no expensive crosstalk between nodes. The nodes use lockless

shared memory data structures and careful manipulation of processor caches and

read/write ordering management to handle internal communication and filtering, and

the system overall employs a lightweight scheduler to load balance classifiers across the

nodes. Each node sends metrics of its own workload and processor utilization to the

scheduler, which uses the information to manage resource allocation and classifier

topology across the nodes.

Bananas relies on the use of the following techniques to meet its design requirements:

• Data flow pipelines with zero data replication

• Use of the high bisection bandwidth inter-core communication fabric inside a CPU

• Lockless multi-threaded data management and data processing algorithms. Typical

use of locks and/or semaphores for any type of synchronization, consistency

enforcement, or communication makes true linear speedups (increase of

performance versus number of threads) impossible.

• Fast, adaptive, predictive, multi-level topology and resource scheduler (sub-

second re-organization period).

Bananas leverages the communication fabric inside IA64 processors to produce high

bandwidth replication by bringing a document into the second-level processor cache,

and then maximizing the probability of satisfying requests from all other threads on all

other cores from the neighbor cache. The use of this type of “super-node” allows

Bananas to flatten a broadcasting tree, reducing both the cost and latency by a factor of

approximately 100x, and in some scenarios, over 1000x. (Performance gain factor

depends on exact type of classification algorithms being executed, partitioning

granularity and actual hardware.) These design and architectural decisions enable

Bananas was designed with high throughput as the primary goal— this ensured low

latency and greater proportion of CPU cycles dedicated to data processing instead of

message passing and inter-node synchronization operations.


Bananas to effectively maintain constant latencies with increasing throughputs. Figure 5

shows the throughput vs. average latency curves for Bananas with 48 and 96 data

classifiers working on the same 4-node cluster.

Figure 5: Bananas Throughput vs. Latency for Processing Load of 48 and 96 Classif iers on

the same test setup

As the workload on each Bananas worker is doubled the observed latencies naturally

increase, quite remarkably in the case of Bananas, by only a factor of two that

demonstrates almost linearly scalable performance. Also noteworthy is the effectively

constant latency across increasing throughput.

Conclusions • Bananas defies common industry wisdom and processes data at consistently high-

throughput and low-latency, both important criteria for a streaming system to meet

current and future data processing requirements.

• Spark Streaming is essentially an abstraction over the Spark Batch Processing system

and is unsuitable for practical streaming systems that require high-throughput while

performing computationally intensive tasks at sub-second latencies.


• Our results showed that a truly real-time system can never be one that batches data

and processes them in slices. Not only is significant time spent scheduling tasks, but

also there is an inherent risk of backpressure and the inflexibility of modifying time

windows to withstand temporal variations in data traffic.

Future Tests These tests demonstrate the overall performance advantage of Bananas over Spark

Streaming in a real use-case scenario. There are several other possible tests that would

verify the versatility and adaptability of Bananas under a variety of processing

workloads. A subset of these tests will be performed in the future, including but not

limited to: • Comparison with other popular Open-Source streaming solutions, including Apache

Storm, Apache Flink, Apache Tez, Apache Samza, Apache Apex and Google Cloud

Dataflow

• Scalability tests over multiple servers with both greater and fewer cores than in the

current test case

• Machine Learning tasks that involve significant CPU intensive matrix calculations and

data aggregation

• Word-count, K-Top words, Grep, and other such test cases used to benchmark

Spark Streaming in published papers and reports.

benchmark: bananas vs spark streaming

Data & Analytics