3 reasons enterprises struggle with storm & spark streaming - datatorrent

.

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS Deliver fast actionable business insights for data scientists, rapid application creation for developers and enterprise-grade operational excellence for IT

3 Reasons enterprises struggle with Storm & Spark Streaming | whitepaper

www.datatorrent.com

1

Getting to fast actionable insights means

empowering analysts and data scientists to easily

work with data from many data sources (both in

motion and at rest), gain insights in seconds,

visualize the data insights and take action

automatically, without the need to involve the

entire IT department. At the same time, data

center operations teams need to ensure that the

solution is operational and meets business SLAs.

Given the buzz around Spark Streaming & Storm, they can seem like obvious choices for supporting streaming analytics. However, most of our customers have struggled to take both Spark Streaming & Storm beyond the proof-of-concept stage as they address the enterprise objectives too narrowly to offer a complete solution. Enterprises require an easy to use, visual tools-based approach that works out of the box. The platform needs to meet the needs of data scientists, developers and the data center operations teams without needing extensive & expensive patchwork of custom code & third party software that often fails

DataTorrent RTS is the industry’s first fully Hadoop native streaming analytics solution. DataTorrent RTS provides an enterprise grade streaming analytics platform, delivers tools and pre-built analytics modules and “lights out” data center operational capabilities.

This paper explores the top 3 reasons enterprises pass on Spark Streaming & Storm and deploy DataTorrent RTS.


www.datatorrent.com

2

1. Enterprise-grade streaming analytics platform

Your streaming analytics platform needs to meet the needs of your business. It’s not sufficient to take open source code that might work for some large web-scale organizations with scores of platform level developers and try to deploy in an enterprise data center. Most enterprises don’t have or want developers that are coding at the platform level. Imagine having your developers struggle with tuple level acking, configuration & distributed state management! Enterprises have strict business requirements for SLAs (no data loss, performance/latency and availability) and they want their developers to focus on solving their core business problem

With this goal in mind, DataTorrent RTS was built from day one as a Hadoop 2.x native application. DataTorrent RTS natively supports Hadoop YARN and HDFS on every commercial Hadoop platform. It also runs seamlessly in public or private cloud environments. IT organizations get the benefits of high performance, in-order processing, auto-scaling, dynamic updates, automatic fault tolerance of application state, engine state as well as raw data & distributed in-memory analytics without having to hand code any of these capabilities

An enormous amount of data is being generated each day, of different variety, at different sizes and at different rates. This fast big data is critical to an organization’s ability to gain competitive advantage and acheive operational efficiencies. It’s important that your streaming analytics solution not only handles the different data types but also provides appropriate processing guarantees. DataTorrent RTS is the only streaming analytics solution that can provide exactly-once, at-most-once and at-least-once event processing guarantees while still achieving the low latency of per tuple processing and not resorting to micro-batching

The decisions that are being made based on insights gained from fast big data are typically in an operational data path. Enterprise grade fault tolerance is required for fast big data insights to be operational. DataTorrent RTS provides fault tolerance for raw input data (even when the input source is not stateful), engine state as well as processed data (application state)– all without human intervention in the event of an outage. Also, only DataTorrent RTS supports ‘incremental recovery’ which allows a failed node to recover its state and raw data stream from the previous node rather than requiring replay from the first step. This significantly reduces recovery time and ensures latency SLAs are maintained

Where Storm & Spark Streaming fall short Apache Storm & Spark streaming’s applicability is limited by their core architecture. With Spark Streaming, the inherent RDD based processing paradigm introduces overhead and latency to stream processing performance. The per-tuple acking in Storm is notoriously problematic in production environments and creates severe operational headaches when scaling a topology or troubleshooting bottlenecks & failures. Both Storm & Spark streaming force users to ‘micro-bath’ input to provide exactly-once processing guarantees. This introduces significant latency in processing. Also, ability to maintain event order or provide application state level fault tolerance are not part of the core platform for both Spark Streaming & Storm. These are critical components of a stream processing platform and a must have for most of the use cases (eg. Imagine trying to do event sequence based pattern detection). Implementing these require non-trivial programming with intricate understanding of the underlying streaming platform & concepts and require constant maintenance and update with each release of the platform. Finally, all the ‘workarounds’ you have to build into your business logic create significant lock-in for your application


www.datatorrent.com

3

What to ask To ensure an enterprise grade solution that meets your organization’s SLA requirements, ask the following questions of your proposed solution:

• If Hadoop is your core big data platform, does your streaming platform seamlessly use HDFS for raw data & application state checkpoints & engine state management to reduce dependence on external datastores like relational databases that do not scale? Also, does your streaming platform run natively on YARN for scheduling without having to deal with making the underlying streaming platform scheduler work well with YARN as that can cause significant multi-tenancy & operational issues?

• Can the streaming analytics solution auto-scale and process increased data loads without manual programming and re-deployment?

• Does the streaming analytics platform guarantee the processing order of your events across all processing guarantees – at-most once, at-least once & exactly once without having to micro-batch the input data?

• Is the streaming analytic solution’s fault tolerance complete (raw events, app state & engine state), abstracted from the developer and done natively in Hadoop using HDFS?

• Streaming analytics applications need to be able to handle events non-stop. Does your streaming analytics solution support dynamic updates to application properties and business logic with no application downtime?

2.Data scientist and application developer friendly

The path to a production ready streaming analytics solution entails a lot of experimentation upfront. Data scientists and developers should be able to use intuitive visual tools to quickly create streaming applications and iterate over their hypothesis. These iterations should not always involve cumbersome coding by developers. Developers should be able to simply create organization specific business logic (e.g. custom parsers) from any data source and make it available for data scientists to visually assemble the streaming application.

The DataTorrent RTS streaming analytics solution enables rapid time to market/time to value via pre-built modular analytics capabilities that are easily combined using a visual interface. Development is simple with a single-threaded Java based development model that allows for arbitrary business logic (often re-using existing code!). In order to get your developers productive in no time, DataTorrent RTS provides over 450 pre-built Java operators that provide a raft of analytical capabilities. 75+ input and output operators allow for data ingestion and distribution from sources such as Kafka, Flume, message busses (JMS, MQ, etc,) databases (SQL, NoSQL), web sockets and more. All the platform processing guarantees, idempotency & state management are automatically extended to the input & output connectors & all other operators so no additional platform level development work from the application developer is needed

The Java operator-programming model is simple, yet powerful as DataTorrent RTS provides key capabilities that are left up to the developer in open source streaming analytics platform. Developers do not have to worry about multi-threading the code, the application is automatically partitioned and distributed across the Hadoop cluster for scalability. Another key capability is native application support for application time-series windows that are both aggregate (per minute, per hour) and rolling (last 5 minutes, last 3 hours). As mentioned earlier, fault tolerance is a platform capability and abstracted from the developer.


www.datatorrent.com

4

Where Storm & Spark Streaming fall short The Java API in Spark Streaming & Storm requires a lot of hand coding as there is no library of pre-built code. Data input & output connectors are few. The Java interface in Spark Streaming is notoriously hard to use as there is a significant bias towards Scala. With Storm, even though Java is supported, developers have to hassle with doing tuple level acking in their application code. Besides the lack of a starting point, for both Spark Streaming & Storm, programming is tedious as the developer must manually account for scalability, handle input data skews, hand-code fault tolerance for the application data and attempt to force event ordering/re-ordering. Spark streaming & Storm do not have any visual development tools so coding must be done by a developer and does not allow for a data scientist that is not familiar with Streaming to create simple applications to quickly iterate over their analysis.

What to ask To ensure that data scientists and developers can rapidly assemble applications, ask the following questions of your proposed solution:

• Does the streaming analytics solution have connectors to support fault-tolerant & auto-scaling data ingestion & distribution for all of your data sources & analytics destinations “out of the box”?

• Are common data analytics capabilities such as joins, aggregations, and statistical analysis available out-of-the-box? How about complex capabilities such as dimensional cube creations and integration with machine learning tools?

• Does the solution aggregate data over varying windows, both static and rolling, automatically, or does the developer have to manually implement?

• Is the solution data scientist and business analyst friendly with a visual application creation and data visualization tools?

3. Robust management and operational deployment

Fast big data doesn’t stop and neither can the insight and actions that your business takes. As a result, streaming analytics applications are designed to run 24x7 with no downtime. Data center operations teams need to ensure that the full lifecycle of application deployment, monitoring, updating, and problem resolution meets the organization’s business commitments. Management requirements extend not only to on-premise deployments, but also cloud and hybrid cloud/data center deployments.

Designed from day one with enterprise datacenter operations as a requirement, DataTorrent RTS fully embraces the application lifecycle. The DataTorrent solution is fully multi-tenant, allowing multiple applications to run on the same Hadoop cluster optimizing operations and maximizing data center resources.

DataTorrent RTS provides a simple to implement and use application-packaging technology to streamline the handoff from dev to ops. Designed for zero downtime, data center ops teams have the ability to change business logic, modify application window sizes (example 1 hour to 30 minutes) and performance tune a running application without stopping the data processing.

The DataTorrent RTS UI console provides full visibility into the application at a Hadoop container-level, including resource usage and performance/latency statistics in addition to built-in monitoring alerts. Application issue resolution is simplified with application counters, console event alerts and cluster-wide log collection and consolidation.


www.datatorrent.com

5

Where Storm & Spark Streaming fall short Spark Streaming & Storm provide rudimentary capabilities across the application lifecycle. The management & monitoring platform does not provide full visibility into all metrics of the streaming application and the infrastructure. There are no considerations in Spark Streaming & Storm architecture for dynamic application updates.

What to ask • Does your organization require easy to use tools for the full application

deployment & management operations cycle? • Are visual, automated alerting and command line tools required for your

data center operations team? • Does the streaming analytic solution have built in capabilities to make

application modifications dynamically?


www.datatorrent.com

6

Conclusion

Enterprises are seeing greater opportunity to better serve their customers, drive greater revenues and reduce costs through operational efficiencies. In order to capitalize on the opportunity, organizations are looking for solutions that enable rapid insights and action to be taken on fast big data. An enterprise-grade solution is required that meets the needs of data scientists, developers and data center operations.

The top 3 reasons that enterprises are deploying DataTorrent RTS over Spark Streaming are summarized below.

Enterprise-grade streaming analytics platform

• Industry’s first Hadoop-native, fully multi-tenant YARN and HDFS based architecture

• No data loss with automatic fault tolerance for raw event data, application state & engine state

• High-throughput, in-memory & low-latency event processing with no need to micro-batch

• At-most-once, at-least-once and exactly-once processing guarantees while guaranteeing event order!

• Auto-scaling & auto-partitioning of event streams for skew management Data scientist & application developer friendly

• Visual application creation tool that utilizes the 450+ open source Java operators

• Ability to ingest data from and distribute to any source with more than 75 pre-built adaptors

• Open source library of 450+ operators for a wide variety of real-time analytics & transformations

Robust operations & management platform • Simple application packaging and deployment • Intuitive UI for end to end management, monitoring, reporting &

troubleshooting • Dynamic application updates with no application downtime • Light footprint (no need to deploy on every Hadoop node) for simple

installation & upgrade • REST API for easy integration with enterprise tools

Additional Resources

DataTorrent RTS: Data sheet

DataTorrent RTS Whitepaper

DataTorrent download

DataTorrent Inc., 3200 Patrick Henry Drive 2nd Floor Santa Clara CA 95054 +(1) 408-331-5034, ext #101

3 reasons enterprises struggle with storm & spark streaming - datatorrent

Documents