when streaming becomes strategic - mapr...when streaming becomes strategic 1 a new system of record?...

The Bloor Group

WHITE PAPER

WHEN STREAMING BECOMES STRATEGIC

MapR Provides an Architectural Foundation for Data-Driven Enterprises

Robin Bloor, Ph.D. & Rebecca Jozwiak


1

A New System of Record?

The system of record feeds the collection of applications that run the business, such as an ERP

system. Traditionally such systems passed information via ETL software to data warehouse

systems, which in turn fed data marts for the sake of BI and analytics applications. The defining point is that the system of record holds the source data of the organization – the golden record of the “truth.” Often different contributing applications did not agree perfectly on the definitions of data, so between the system of record and the data warehouse sat master data management (MDM) software to reconcile such differences and provide business level data definitions for users to help understand the corporate data universe.

For years, that’s how it was. It worked reasonably well until new technologies and influences forced their way into the corporate IT environment. You can think of these as falling into two categories:

• The growth of data

• The acceleration of data flows

Data didn’t suddenly become “big.” On average, corporate data has been growing at roughly 60 percent per year for decades, and it continues to do so. Some of this growth can be attributed to new applications, particularly web-facing applications and mobile applications. Some is from external sources such as social media or public data feeds. Some is data we never bothered

to retain or process, such as that in log files. Some comes from office systems (email, instant messaging, etc.).

Because of these varied formats, a good deal of data did not arrive in a conveniently structured form. Unstructured though some of it was, if it recorded or influenced the decision making, then it clearly qualified as part of the system of record. Hadoop and its attendant software ecosystem emerged at the right time to qualify as a possible repository for such data.

The Data Flow Issue

While data continued to grow like bamboo, the second disruptive factor came into play. Although we described this as the acceleration of data flows, we could have described it as the trend to data streaming. The old system of record and data warehouse arrangement was founded on the idea that data lived in specific locations, and where there was a need, it was replicated to other locations, including the data warehouse.

This data architecture came under pressure with the advent of the internet, which mandated

24 x 7 web-facing systems. These applications, in turn, impacted corporate applications via web services and a service-oriented architecture, forcing some of them to run 24 x 7. This engendered first generation messaging systems (such as ESBs) to service the flow of data and messages between programs. Gradually, the time windows during which data could be moved became shorter and shorter, provoking software changes and often more hardware to speed up the process.

It has become increasingly apparent that a streaming (i.e., data flow-oriented) architecture is necessary. Such an architecture can and should be seen as distinct from the streaming systems that were introduced quite early on in the financial services sector for automated trading. Although these were streaming architectures of a kind, they were focused on a particular application rather than built to provide a general purpose streaming environment.


2

Streaming Architecture

What defines a streaming architecture is that it focuses equally on data flow and data storage. It could be described as an event-driven architecture in the sense that the data that it presides

over includes events: website events, application events, customer events, social media events, analytics triggers, sensor data, log file events, and so on. In the past decade, we have steadily moved from a transactional world to an event-based world. The system of record now needs

to include the events that determine the behavior of the business, whether those events lead to

immediate action or are simply stored for later analysis.

Even businesses that are not real-time, in the sense of needing to process data as soon as it arrives, need to think in terms of a streaming architecture. Ultimately, batches of data are collections of events and should be processed event-by-event if possible. Operating in this way makes it possible to reduce latencies without any need to make fundamental changes to the software architecture. Either there is adequate capacity in the IT environment, or you deal with

latency simply with the judicious addition of hardware resources. Older transactional software

architectures are unable to reduce latencies so easily and may even get bent out of shape just by increases in data volumes.

Hadoop and the Problem of Data Gravity

We noted that two factors brought disruption to the IT environment in recent years: the growth of data and the acceleration of data flows. Hadoop gained its popularity because it provided an economic way to manage many of the first of these two factors. But, on its own, it only handles half the problem. The reality is that data has gravity in the sense that if you accumulate all the data in a single place, then moving it starts to become prohibitive. You become obliged to move the processing (the applications that wish to use the data) to the data.

This eventually forces centralization and puts an awkward constraint on flexibility. Ultimately, if you build an ever-expanding data lake, applications will drown in it. Software architecture requires being able to distribute data and processing flexibly across available resources. In our view, the event-driven world that is gradually emerging requires both an ability to scale up the processing of data and an ability to manage data flows. Both need to be intelligently catered for.

MapR 5.1 and MapR Streams

The MapR Converged Data Platform was built with the idea of data movement in mind. For a variety of reasons, MapR chose not to use HDFS directly, instead preferring to develop its own file system, MapR-FS, which was POSIX compliant, supported the HDFS API and is better suited to the full range of application workloads, not just large batch jobs. It improves performance for most workloads and delivers better overall data security. It also provides a

sound foundation for implementing data distribution.

MapR’s goal was to establish a coherent global file system that could distribute data and applications across multiple Hadoop clusters, both locally and remotely, between data centers and into the cloud. It never subscribed to the idea that organizations should be limited to a large, central data lake. A key component needed to realize this vision was a real-time data and messaging transport system. This is what MapR Streams, released with MapR 5.1, delivers.

Figure 1 illustrates how the MapR Streams publish/subscribe capability works. As the left side of the diagram indicates, MapR Streams is one of three components that can be accessed directly by applications. The other two are MapR-DB and MapR-FS. An application, whether


3

it is a normal business app, one of the bulk processing apps or a stream processing app, can interact with any one or all of these MapR components. Applications use MapR Streams either

to send messages or data to other applications or to receive messages or data from them. There is no limit to the number of applications that can connect to MapR Streams.

As illustrated on the right side of the diagram, data producers (publishers) or data consumers (subscribers) connect to a specific data substream (a topic). One or more publishers send data to the topic, and it is transmitted at once (record-by-record from memory in real time) to all the

consumers for that topic. Thus, the above diagram shows Producers 1, 2 and 3 writing data to Topic 1, which is immediately sent to Consumers 1 and 2. Simultaneously, Producer 4 is writing data to Topic 2 that is being transmitted to Consumer 3. A grouping of topics constitutes MapR Streams, which might be dedicated to a specific distributed application, business system or IT service.

MapR Streams supports multiple streams, with a maximum of 100,000 topics per stream and no limit to the number of streams, or messages/events, within each stream. There is also no limit to the number of producers that can write to a given topic and no limit to the number of consumers who can receive data from a given topic. As such, you can think of MapR Streams as a multitenant, in-memory streaming capability with a capacity of billions of messages/events per second.

It is robust, guaranteeing message delivery and providing automatic management of disconnection/reconnection in the event, for example, of the failure of a communications line. Its security is part of a unified framework that also embraces the MapR-FS and the MapR-DB. It provides authentication, wire-level encryption and a granular level of authorization for producers, consumers and MapR Streams administrators. In practice, it requires little

administration beyond the definition of topics and streams and the specification of service levels.

The Global Capabilities of MapR

With the addition of MapR Streams, MapR becomes a truly global data platform able to support any type of distributed workload, ranging from the bulk processing of Hadoop applications (MapReduce, Hive, HBase, etc.) to real-time stream processing using Spark, Storm or any other data streaming capability. In its prior release, MapR already delivered global data distribution

Figure 1. MapR Converged Data Platform

Consumer1

Consumer2

Consumer3

Topic 1

MapR Streams

Global platform, distributable acrosslocal and remote clusters

Topic 2Producer

3

Producer2

Producer4

Producer1

MapR-FS MapR-DB

Applications StreamProcessing

BulkProcessingApplications Stream

ProcessingBulk

ProcessingBusiness

ApplicationsStream

ProcessingBulk

Processing

MapR Streams

MapR Platform Services

MapR Converged Data Platform


4

capabilities via MapR-DB, but now its capabilities are more general, and a real-time data transport capability is embedded in the data platform.

Figure 2 illustrates two instances of the platform in Data Center 1, one instance in Data Center 2, and one instance in the cloud. First, consider the disaster recovery (DR) possibilities. Using MapR Streams, the cluster in Data Center 2 could be configured to mirror all the data held in the three other clusters and be brought into action in the event of a catastrophic failure in Data Center 1 or in the cloud. In fact, any of the MapR instances could be used in that way and kept current, in real time.

Data Center 1 depicts the distribution of local workloads. A Hadoop cluster can become overloaded by too many applications competing for the same data. Adding more servers to the cluster may fix the problem, but it is not guaranteed to do so. With the MapR platform, this situation can be resolved by configuring another cluster and using MapR Streams to keep the two clusters in sync. Alternatively, for workloads that work well together, the bulk processing applications perhaps could be confined to one cluster, while streaming applications could run on the other. MapR Streams can distribute the data that needs to be shared across the two

clusters.

This ability to distribute workloads across Hadoop clusters within a data center is remarkably flexible and extremely useful. It enables true capacity planning and workload management, rather than giving way to the Hadoop cluster sprawl that is currently becoming quite common. It would, in fact, be relatively easy to transfer applications and their data from one cluster to

another – or even have the same application available to run on either cluster. It is possible to migrate applications from one cluster to another when desired.

Similarly, it would be possible to transfer applications and their data between data centers.

This is a kind of multi-master replication: a situation where data is shared between multiple instances or locations and can be updated from any one of them. This is a capability some

databases provide, and MapR provides it for both streams and databases.

Figure 2. MapR Converged Data Platform: Real-Time Event Streaming

MapR-FS MapR-DB



ProcessingBulk

ProcessingApplications StreamProcessing

BulkProcessing

MapR Streams


MapR-FSMapR-DB



ProcessingBulk


BulkProcessing

MapR Streams


MapR-FS MapR-DB



ProcessingBulk


BulkProcessing

MapR Streams


MapR-FSMapR-DB



ProcessingBulk


BulkProcessing

MapR StreamsReal-Time Data Transport


Data Center 1

Cloud

Data Center 2

Real-Time Data Transport

Real-Time Data Transport Real-Time Data Transport


5

The Application Layer

Traditional IT architectures have provided a variety of components to cater for data flows. Aside from the ponderous ETL products, there were exchange data capture and database replication capabilities, message queues and ESBs, and streaming software – each catering for slightly different usage contexts and none providing a comprehensive solution. Lacking coherent data flow capabilities, Hadoop implementations often become silos. While that may be adequate for the workloads involved, it does not support the timely flow of data between Hadoop clusters or dependent applications in a cohesive way. Such siloed clusters are data sinks rather than

components of an event-oriented architecture.

MapR resolves these two awkward issues by catering for stream processing, bulk processing, and everything in between and also by providing a versatile global distribution capability.

The primary use case for Hadoop is analytics in its all its variety: time-series analytics, real-time predictive analytics, data mining and discovery, graph analytics, analytics on unstructured data and text, and so on. With the current release of MapR, an additional use case is added to

this: the maintenance of the system of record.

Bearing the capabilities of the MapR Platform in mind (security, versatile file system, performance, real-time communications), what is emerging can be thought of as a global operating system for data. Impressive though it is in its own right, the MapR Platform needs also to be considered as a data service platform for both distributed and local applications. The

multitude of complementary Hadoop components (open source components such as Hive, HBase, Drill, Spark, Storm, etc., as well as commercial components) provide an extremely fertile application development and execution environment.

This is particularly the case for the analytics and BI applications that currently dominate the

Hadoop landscape. In terms of latency requirements, time series and predictive analytics are generally streaming applications that need very low latencies. Some BI applications (alerts, dashboards, performance management) are low latency, while others are less demanding. Data discovery and exploration and big data analytics (on structured data, unstructured data, graph data, text and so on) are averse to long wait times and hence need scale-out parallelism, but don’t demand very low latency.

Precise requirements vary from business to business, of course, but it is clear that both this

range of applications, and also the system of records, will be best served by a data environment which implements a streaming architecture (or event-driven architecture) that is able to cater for both bulk processing and real-time analytics.

MapR 5.1: A Customer Story

The new release of MapR has already been deployed in beta sites, including businesses in the health sector and financial sector. The health care implementation was particularly interesting because it required a global capability, as the company involved has data centers in the US and in the European Union. Because of this, there were healthcare compliance requirements

involved: the exacting HIPAA regulations in the US and in-country storage regulations in the EU, where personal health details needed to be stored in the country of origin. The company wished to build a system of record, but because of compliance regulations, it would have to be distributed. This is a particular requirement that MapR is suited for.

The overall objective was to build a global, flexible, compliant healthcare database. To add to


6

the complexity of the project, there were a variety of users (data consumers) using a variety of applications, which meant that different data models needed to be catered for. Additionally, a

search capability was required, but, because of compliance constraints, not all data could be

searched.

An overview of the solution is illustrated in Figure 3. Applications local to the data center would generally access local data, held on file, in the Graph DB or in MapR-DB. In this way, all required data structures were catered for. Because of compliance restrictions, only some

data was replicated between data centers. For specific requirements, materialized views were continuously computed in MapR-DB or the Graph DB for use by Elasticsearch. This was done for the sake of performance.

The creation of the system of record was achieved as specified: secure, immutable, rewindable, and auditable.

In Summary

MapR’s vision is quite distinct from other distributors. It delivers a different architectural

approach to Hadoop while shipping and supporting all open source projects in their entirety. As far as we are aware, it is the only distribution that offers a truly global capability that supports everything from real-time analytics to bulk processing. More than a data platform, it is fast becoming an operating system for data and a global system of record. For companies that are currently planning to implement Hadoop at a corporate level, we advise taking a close look at MapR.

MapR-DB(JSON)

Graph DB(Titan onMapR-DB)

MapR Streams

Records

Search Engine(Elasticsearch)

Applications

Data Input/Updates

MapR-DB(JSON)

Graph DB(Titan onMapR-DB)

MapR Streams

Records

Search Engine(Elasticsearch)

Applications

Data Input/Updates

US Data Center EU Data Center

Figure 3. Global Database (SOR) Implemented on MapR

About The Bloor GroupThe Bloor Group is a consulting, research and technology analysis firm that focuses on openresearch and the use of modern media to gather knowledge and disseminate it to IT users.Visit both www.TheBloorGroup.com and www.InsideAnalysis.com for more information.

The Bloor Group is the sole copyright holder of this publication.Austin, TX 78720 | 512-524–3689

when streaming becomes strategic - mapr...when streaming becomes strategic 1 a new system of record?...

Documents