big data’s velocity and variety challenges · 2020-01-07 · 3 1 introduction in 2001, doug laney...

26
1 Ross Mason, John D’Emic, Ken Yagen, Dan Diephouse, Alex Li Big Data’s Velocity and Variety Challenges

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

1

Ross Mason, John D’Emic, Ken Yagen, Dan Diephouse, Alex Li

Big Data’s Velocity and Variety Challenges

Page 2: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

2

Table of Contents

1 Introduction

2 Emerging Architecture for High-Velocity Big Data

Trend towards Real-Time

Real-Time Data Capture at the Edge of the Network

Anypoint Platform & Real-time Data Ingestion

Application Analytics – time-sensitive and low-latency data

Capture Exhaust Data – low-value data

Anypoint Platform & Real-Time Processing

Apache Kafka, Storm and Spark

Anypoint Platform and Big Data Streaming

3 Emerging Architecture for High-Variety Big Data

Storing Unstructured Data and NoSQL Databases

Select Anypoint Platform Use Case with NoSQL - Document Storage

Anypoint Platform’s strengths for Integrating NoSQL Databases

APIs for High-Variety Big Data

API Use Case with NoSQL

4 Conclusion and MuleSoft’s Big Data Integration Solution

Page 3: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

3

1 Introduction

In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume, high-velocity and high–variety information for insights and decisions. More than a decade later, Hadoop and NoSQL ecosystems have flourished and been popularly adopted by an increasing number of enterprises. According to a report commissioned by Dell, 96% of midmarket companies are embracing the rise of big data. Kafka, Spark, Storm and many new innovations, in addition to Hadoop and NoSQL databases, are providing us with more pieces to solve the puzzle. However, organizations are still wrestling with two of the three (in)famous Vs.

Velocity: Real-time business intelligence (BI) and real-time machine learning applications are essential for analytics-driven organizations. Meaningful insights and decisions at the right time and right place require both low-latency data ingestion and high-speed stream processing. Traditional data integration / ETL tools and hadoop’s inherent batch-processing model are intrinsically incompatible with real-time big data applications. Variety: Mixing and matching unstructured data from disparate sources and connecting multiple NoSQL and relational databases could be extremely complex. In a recent Gartner report, 1/3 of companies surveyed mentioned “obtaining skills and capabilities needed” and “integrating multiple data sources” as two of their top four challenges. Learning new query interfaces and hand-coding point-to-point data integration could mean waste of precious IT resource and slow development.

At the same time, companies are embracing integration frameworks and APIs to modernize their legacy IT systems, addressing two similar issues –process automation by switching to event-based flows and integration of various applications and data sources on premise and in the cloud. As big data evolves to become an integral part of the enterprise infrastructure, a unified architecture and development solution is emerging. MuleSoft’s Anypoint platform plays a crucial role in helping CIOs, architects and developers solve the velocity and variety challenges around big data.

Page 4: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

4

2 Emerging Architecture for High-Velocity Big Data

2.1 Trend towards Real-Time

Two trends are driving the increasing adoption of real-time architecture.

1. Centralized data processing drives real-time ingestion and shift from ETL toELT

We are increasingly shifting the heavy-lifting of data transformation and cleansing work to Hadoop or other clusters, because we are generating and collecting data at a higher speed. So, the traditional extract-transform-load looks more like extract and load in real-time, followed by massively-parallel processing of data. Particularly for big data, while some use cases may leverage batch capabilities, high-latency handling (outside big data infrastructure) significantly reduces the value and increases costs in many emerging high-impact use cases. Realizing Data Hub or Data Lakes would require real-time capabilities in many cases.

2. Time value of data and reactive programming drive real-time processing andpresentation of data

For end users of data, real-time is the ability to react to something as soon as it happens. In recent years, reactive applications have been an emerging trend gathering great fanfare, and being “event-driven” is one important pillar of the emerging paradigm. Typically, when we want to achieve real or near-real time, we design architectures that can respond to data as it arrives, without necessarily persisting it to a database. Being real-time is not only a necessary and inevitable step to make machines behave, think and decide more like humans. It is also driven by general users’ psychological need and real-world economic implications. People hate waiting, and a query that takes minutes or hours to respond is unbearable. Delays make information less accurate and decisions less relevant.

Below is a summary of rationale behind adopting real-time big data architectures throughout data cycle of capture - ingestion - processing - presentation.

• Capture: Low-latency, high-throughput data require real-time data capture andprocessing. For example, IoT are generating large amounts of machine data thatrequire continuous handling at the edge.

Page 5: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

5

• Ingestion: High-latency data ingestion into Hadoop and other NoSQL databases incurs unnecessary costs. First, staging and conducting ETL on a large batch of data occupy more expensive memory and processing inside traditional databases (compared to Hadoop). Secondly, managing hundreds of batch data ETL and data quality processes across various data structures and sources could be costly within your IT organization. This is also what drives companies to shift ETL to ELT with transformations inside Hadoop

• Machine Learning / CEP: The value and impact of machine learning applications rely

on frequent refreshes of rules / models and low-latency ingestion of data. Real-time machine learning applications are about detecting frauds when credit cards are swiped, inferring operational anomalies when or before they happen, and rearranging ads when a user clicks into the website.

• BI / Analytics: A modern global distributed workforce requires 24/7 real-time query and

access to BI / Analytics based on up-to-date data. More and more, big data demands low latency architectural solutions. The choice between two paradigms for data processing, batch and stream, determines latency of applications. Processing a batch of terabytes of data would certainly require time and is fundamentally high-latency. Stream processing conducts computation on fragments of data as they arrive. The Anypoint Platform supports real-time capture, ingestion and processing of big data in this emerging category of architectures.

2.2 Real-Time Data Capture at the Edge of the Network One of the emerging sources of big data is smart devices. Projections from CISCO and many other organizations depict an exploding growth for Internet of Things (“IoT”), which implies both great opportunities and challenges to big data integration and processing. Each smart device would continuously generate a large amount of data. To realize value from IoTs, the key would be to form the “network” that captures, transforms and routes data generated from the embedded devices. MuleSoft’s Anypoint Platform is already running on-premises and in the cloud and can now run virtually everywhere thanks to Anypoint Edge. This initiative not only focuses on allowing Anypoint Platform to run on lightweight embedded architectures, like the Raspberry Pi, but

Page 6: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

6

also to support protocols and devices relevant to the embedded world, like MQTT and Zigbee. For example, embedding Anypoint Edge in routers can help capture requests and routing data in real-time, and achieve monitoring and big data analytics.

   

Anypoint Edge brings lightweight and crucial capabilities to embedded devices, such as routers, machines, and sensors. • Edge supports and sends real-time data stream through light-weight wire protocols,

such as MQTT • Edge’s ability of mapping data across various structures and MuleSoft’s APIkit enable

you build an API layer for embedded devices, making integration into broader information flow easier and more flexible

• Edge can publish derivative events to other systems in the cloud or on premises and

receive action events from these systems

Page 7: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

7

• Edge can be used to monitor time series events and take immediate actions based on

rules • Like MuleSoft’s other offerings, AnyPoint Edge is easy to use develop for and easy to

deploy With the capacity to run on limited hardware platforms and interact with the physical world through smart devices, Anypoint Edge, the first embedded integration platform, has all it takes to deliver in the domains of distribution, decoupling and smart routing, to enable capturing and light transformation of real-time machine data for big data applications.

2.3 Anypoint Platform & Real-time Data Ingestion An ETL-oriented data ingestion model would mean accumulating data in a staging area, conducting transformations and cleansing, and then loading the data into the final warehouse in one batch. This model is not ideal for many big data use cases. Before you decide on whether to adopt real-time data ingestion, the following questions are worth considering: • Is your big data time sensitive, or low latency and high volume? • Are you using expensive database to stage low-value data before NoSQL / Hadoop? • Are you using the data collected for real-time applications? If your answer to any of them is yes, you should consider implementing an architecture that enables real-time data ingestion. Increasingly, those big data use cases fall into those three scenarios. For example, social media data are usually time-sensitive. Web-log data is low-latency and high-volume. Machine data is usually relatively low-value data, which would not validate any expensive processing. Today, many big data applications are real-time, e.g. fraud-detection, customized ads, recommendation engines, predictive diagnostics, etc. As a result, organizations are increasingly collecting new forms of data in real-time and depositing them directly into NoSQL databases, such as HDFS, Cassandra and MongoDB, following an ELT model. MuleSoft’s Anypoint Platform, with both event-driven and batch processing capabilities, is best fit to help enable the switch from batch ETL to real-time data ingestion. For example, for e-commerce and other retail companies, there are usually constant updates to their massive online product catalogs with detailed information (SKUs, inventory, specs, reviews,

Page 8: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

8

etc.). A modern architecture can leverage Anypoint Platform to facilitate data flow from various sources into Cassandra, which could expose the most up-to-date product info to web / mobile apps and other applications through APIs, designed and managed by MuleSoft’s API Platform. Following are two common use patterns of MuleSoft and NoSQL databases for real-time data ingestion – collecting analytics data and exhaust data. 2.3.1 Application Analytics – time-sensitive and low-latency data   Analytic data for modern web, mobile and enterprise applications can be vast and unstructured, making it awkward to capture and operate with a traditional relational database. Analytics data, once captured, can pilot the operation of your applications in real-time. Insight into traffic patterns, for instance, can control the dynamic scaling of virtual machines hosting an application. Site appearance, including the choice and placement of ads, can be tailored for users based on how their behavior clusters them to other users of an application. NoSQL solutions such as MongoDB, Cassandra and Riak provide more natural storage mechanisms for such data, but getting the data into them can be tricky. Some capabilities of the Mule runtime along with connectors for all major NoSQL stores, including MongoDB, Redis, Cassandra and Neo4J can gather this analytics data:   • Application metrics must be typically produced and gathered asynchronously. This is

often accomplished using a messaging fabric like AMQP, WebSockets or JMS. Mule’s event driven architecture and support for almost every messaging protocol make it a natural choice to persist analytics data published on such a layer.

• MuleSoft’s asynchronous processing scope, notification system, Custom Business

Event and wiretap features make it easy to capture analytics data for MuleSoft applications that can be inserted into Redis or Cassandra.

Page 9: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

9

2.3.2 Capture Exhaust Data – low-value data   The plummeting cost of disk storage is making it feasible, and often desirable, to adopt a “store everything” mentality. Many companies are capturing all data since information that seems useless today might be priceless in the future. Organizations can’t afford to throw away data that could be mined later on, but they can afford to store it. MuleSoft’s position in many enterprise architectures, which is typically as a mediation or proxy layer, makes it a natural place to capture this data. It can be challenging, however, to store this data as message formats and protocols offer differ widely between endpoints and flows. Schema-less NoSQL solutions like MongoDB and Cassandra are ideal candidates to store unstructured data such as messages being proxied by Anypoint Platform. This is simplified by:   • MuleSoft’s notification system, asynchronous messaging scope and wiretapping

features facilitate asynchronous message insert into MongoDB and Cassandra using the Cassandra and MongoDB connectors.

 

Page 10: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

10

Overall, if you are storing and plan to leverage time-sensitive, low-latency or low-value data, a real-time architecture enabled by MuleSoft should be considered, to enable high return and short time to market on your big data projects. Penn Mutual, one of MulesSoft’s customers, adopted MuleSoft and Cassandra. Switching from legacy batch-mode to low-latency data ingestion enabled Penn Mutual to supply timely data to external partners.

2.4 Anypoint Platform & Real-Time Processing   Real-time stream processing is the inevitable next stage of big data evolution towards value realization through real-time decisions and actions. Instead of processing server log files hourly, companies want to know what just went wrong NOW. Instead of daily customer preference reports from website clickstreams, companies want to deploy customized ads and deals to the current customers while they are visiting the site NOW. Some other use cases include trading stocks using large-volume real-time data streams, network monitoring, fraud and security breach detection, smart grid and energy conservations, etc. However, even though real-time stream processing has been put in great use at Google, Linkedin and Twitter, implementation of an easy stream processing solution at other companies given skill and resource constraints is still an on-going debate. ETL and Hadoop are intrinsically batch-oriented, and are unfortunately not the solution for real-time big data processing. The strength of Hadoop lies in its parallel storage and processing, which leverages MapReduce and replication to simplify analyses on very large data sets. However, each MapReduce batch job requires slow disk I/O. Replication and serialization also contribute to latency in processing. As a result, ad-hoc queries are slow

Page 11: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

11

and stream processing is hard. ETL tools are in general not designed for “always on”. For each batch, ETL tools set up connections, parallel processes, load the data into memory, and conduct data quality and transformations. There is usually a start and an end to each ETL job. Real-time processing is therefore the immediate need in many practical big data applications. In recent years, we have seen an emerging universe of new solutions, like Amazon Kinesis, Storm, Cloudera’s Impala, Spark and Tez, rising to address the issue. Kafka, Storm and Spark are three open source projects that could play significant roles in real-time analytics and machine learning applications. 2.4.1 Apache Kafka, Storm and Spark Recently, we have seen an emerging real-time big data processing architecture using Kafka and Storm / Spark. Kafka and other message brokers can serve as real-time data aggregator and log service provider, offering replay and broadcast capabilities. Storm and Spark offer processing capabilities for large-volume stream data. The missing piece is how to integrate data flow into the messaging system and incorporating results of the stream processing into broader information flow that powers applications and analytics. ESB and API are architectural tools that fill the gap. MuleSoft’s event-based Anypoint Platform, used for data ingestion and API publishing, together with appropriate architecture (e.g. Mule-Kafka-Storm) could make real-time streaming applications easier to implement and future-proof.

Page 12: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

12

Apache Kafka  

Kafka is an open-source distributed, partitioned, replicated commit log service. It is used as a high-performing messaging broker system for real-time data applications. Many (“producers”) can publish messages to Kafka in different categories (“topics”). Messages flow into partitions (for scalability of Kafka clusters), and append to ordered, immutable sequences – commit logs. All published messages are maintained, consumed or not, for a configurable period of time. Processes (“consumers”) can subscribe to topics and process the feed of published messages.

   

   

Traditionally, messaging systems fall into two models – queuing and publish-subscribe. In a queue, each message goes to one consumer, vs. pub-sub would broadcast messages to all consumers. Kafka can support both. “Consumer groups” can be defined on Kafka. And each message would only be delivered to one process in any consumer group, achieving queuing (if all processes are in the same group). In the traditional messaging use cases, Kafka is compared to ActiveMQ or RabbitMQ. In the emerging big data use cases, Kafka is gaining popularity in helping to address real-time data aggregation and pub-sub feeding. For example, website activities such as views, searches and other interactions are tracked and published in real-time to Kafka, with one topic for each activity. The feeds then can be subscribed and applied in real-time monitoring, processing, and loading into Hadoop or NoSQL stores for offline analyses. This is a typical high-volume, low-latency streaming application. Operational metrics monitoring and processing is another example. Linkedin developed Kafka initially, and has used it extensively for activity stream data and optional data processing, powering products like

Page 13: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

13

LInkedin Newsfeed and offline analytics. Airbnb uses Kafka for event pipeline and exception tracking. Other users include Square (metrics, logs, events), Box (production analytics pipeline and monitoring), Tumblr, Spotify, Foursquare, Coursera, Hotel.com, and many others.

Apache Storm Storm is an open-source distributed real-time computation system for processing high-velocity, large stream of data. It is designed to integrate with existing queuing and bandwidth systems and processes unbounded sequences of tuples (“streams”).

A Storm topology is a graph of computation, with each node containing processing logic and links defining data flow. The sources of streams are called “spouts” and processing nodes called “bolts”. A storm cluster is structured very similar to a Hadoop cluster, although adapted for real-time processing. While MapReduce jobs run on Hadoop are batch-oriented, and would eventually finish, Storm topologies process messages forever until they are killed. Frameworks are available for integration with JMS and RabbitMQ / AMQP, which are well-supported by Anypoint Platform. Storm has been adopted by Twitter, the Weather Channel, WebMD, Spotify, Groupon, RocketFuel and many others for various real-time applications. For example, at Yahoo!, Storm empowers stream / micro-batch processing of user events, content feeds and application logs. At Spotify, Storm powers music recommendations, monitoring, analytics and targeting.

Page 14: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

14

Apache Spark Spark is an open-source cluster-computing environment similar to Hadoop, but with differences that make it superior for certain use cases. Spark’s in-memory cluster computing enables caching of data to reduce latency of access. Instead of following a strict i/o – map – reduce – i/o chaining cycle in Hadoop, you can perform map or reduce in any order without the costly i/o in between. This significantly speeds up iterative processing jobs and interactive queries. Spark applications are easier to write, with java, scala and python as three programming language options. Spark also powers a stack of high-level tools for machine learning (MLib), GraphX (graph data), SQL (Spark SQL) and streaming (Spark Streaming). Spark is increasingly becoming a more preferred processing engine in the Hadoop ecosystem. It can be run on YARN, and read data from HDFS, HBase and Cassandra. The in-memory design of Spark and its integration with Hadoop are making it the potential engine that unifies batch and real-time processing. Although it is still in its early-stage of adoption and may lack some enterprise-level support and polish, it is quickly gaining momentum. Groupon, Autodesk, Alibaba Taobao, Baidu and many others are all users of Spark.

 

    2.4.2 Anypoint Platform and Big Data Streaming A few characteristics of Anypoint Platform make it a perfect solution for lightweight real-time data transformation and integration tool, to facilitate data inflow and outflow around Kafka-Storm / Spark architecture.

Page 15: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

15

Best-in-class performance: It can process up to tens of thousands of messages per second (depending on data structure, transformation, hardware specs, etc.), capable of feeding streams of data into message brokers for aggregation. IT also offers a stateless architecture so can scale horizontally on commodity hardware. Broad transformation and connectivity functions: It provides easy connectivity and mapping among virtually all business data sources and services, datastores (traditional and NoSQL), messaging systems and data structures. For a list of connector offerings, please refer to below. All-in-one integration architecture-level solution: Anypoint Platform, which includes the API platform and Anypoint Edge, support and cover real-time data capture, batch processing, transformation and integration throughout all data flows across your organization, on premise or in the cloud or in edge devices. You can use MuleSoft to bring in all data streams into Kafka-Storm or other real-time data application processing system. Real-time monitoring of operational metrics is one example of how Anypoint Platform can help integrate data to power real-time big data application. In many organizations, it is already facilitating data flow among various operational data sources and storage, such as CRM, ERP, HR, marketing, finance and web apps for thousands of organizations. These data flows could be extended and published to Kafka in real-time, as well as deposited directly into traditional and NoSQL databases for mining and future usage. Kafka can feed the aggregated stream data into Storm or Spark for processing, such as combining, comparing, slide-window analyses, or applying machine-learning models. The real-time results would then flow through Anypoint Platfrom to monitoring tools or automate decisions.

   

Page 16: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

16

With AnyPoint Edge facilitating capture of real-time machine data, Mule-Kafka-Storm architecture can present a solution to real-time machine-data processing.

In summary, as organizations start to evolve to modernize data architecture towards real-time applications, crucial changes must be adopted to capture, integrate, store and process low-latency and high-volume data. MuleSoft provides the platform and products to support your adoption of new big data technologies, and seamlessly modernize your IT system for both small-data and big-data real-time integration throughout data lifecycle. • Real-time capture: Anypoint Edge and API Platform + embedded devices • Real-time ingestion: Anypoint Platform (on-prem or cloud) + NoSQL databases • Real-time processing: Anypoint Platform + messaging queue + stream processing

engines

3 Emerging Architecture for High-Variety Big Data

A few years ago, we needed to make the decision of what data to store first before designing the system to store, cleanse, and manipulate the data into the right form. With the rise of big data, many companies have begun to store data of all types and structures, with the hope that the data will be of use in the future. The number of data sources exploded – social media, log, machine, images, videos, texts, and documents. The various ways of representing the data also have increased, XML, JSON, binary, graph, etc. Various storage solutions to accommodate these different type and structures, naturally, were introduced.

Page 17: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

17

The high-variety of NoSQL storage solutions therefore added great complexity to handling of high-variety big data. Integrating NoSQL databases present companies with three great challenges: • Selection of the right NoSQL database: what is appropriate for storing documents may

be inefficient for storing machine data or log data • Skill and knowledge gap: internal IT team may not have the knowledge and expertise to

build connection and queries for various new types of NoSQL databases. Finding the right talents or training existing developers may both be costly and time-consuming, increasing the uncertainty and cost associated with the big data project

• Chaos: The compound permutation of various data sources, data stores and data

applications would mean exponential increase in number of connections and paths to facilitate data flow. Point-to-point integration would mean inevitable chaos in the future, and will eventually cause issues. Point-to-point hand-coding data store connectivity also would make monitoring of data flow and usage difficult, putting the company at a disadvantage in the environment where big data is enabling big services with increasing number of data consumers internally and externally.  

MuleSoft provides such a solution to standardize data integration connections among NoSQL stores and sources. The connectors to the major NoSQL databases in all categories can help your developers build integration flow, both batch and real-time with drags and drops. MuleSoft NoSQL Database connectors: • Document: MongoDB • Graph: neo4j • Column: Cassandra, Riak, HBase • Key-Value: Redis, HDFS • Cloud: Amazon Redshift (JDBC, general database connector)

3.1 Storing Unstructured Data and NoSQL Databases What storage solution to choose for a Big Data use case is an important decision. Fundamental to this decision is an understanding of the CAP theorem. The CAP theorem states that a distributed system can simultaneously provide only two of the following:

Page 18: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

18

• Consistency: All nodes in the system have visibility to all data at the same time. • Availability: A client to the system will receive a response to a request • Partitioning: The system functions normally despite a failure in another part of the

system

Your choice of a NoSQL solution will be largely guided by which two of the three your usage pattern dictates. The following is a list of NoSQL connectors provided by MuleSoft and how they could help address certain types and structures of data. MongoDB

MongoDB is an open source, document driven database with the following features • Horizontal scaling via sharding • Strongly consistent

Page 19: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

19

• Single documents can be atomically updated • Multi-datacenter replication supported • High-availability via automatic failover with short downtime • Data is stored in BSON format, a binary form of JSON • Indexes are supported • Ad-hoc query language allows data retrieval MongoDB offers Consistency and Partition tolerance. MongoDB is best suited for the Document Storage and Data Exhaust usage patterns. Cassandra

Apache Cassandra is an open source, column driven database, featuring the following: • Automatically horizontally scalable • Semi-flexible structured key-value store • Column and column-family based structuration • Tunable consistency • Multi-datacenter replication supported • Master/master reads and writes • Map/reduce is supported • Ad-hoc query language allows data retrieval Cassandra offers Availability and Partition tolerance. While Cassandra is standardizing towards a CQL based Java driver, the plethora of current drivers and the column-based access and query model make it a good candidate for the API Exposure usage pattern. Cassandra’s excellent multi-datacenter replication features make it well suited for the Multi-Data Center State Synchronization usage pattern. Hadoop Distributed Filesystem (HDFS)

The Hadoop Distributed File System, or HDFS, is a distributed file system primarily used to share data in a Hadoop cluster, featuring the following: • Distributed file system • Automatically horizontally scalable

Page 20: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

20

• High-availability via manual and automatic failover. HDFS offers Consistency. HDFS is typically used in conjunction with Hadoop for distributed map reduce operations, making it a good choice when using the Machine Learning usage patterns discussed above. Riak

Riak is an open-source distributed implementation of the principles put forth in Amazon’s Dynamo paper. Riak features

• Distributed file system • Strongly consistent • Automatically horizontally scalable • High-availability via manual failover • Map-reduce is supported Riak offers Availability and Partition tolerance. These characteristics make it an option for Multi Data-Center State Synchronization. Redis

Redis is an open source key-value based “data structure server.” It provides mechanisms to store strings, hashes, lists, sets and sorted sets. Prominent features include: • Key-value store with rich data structures as values • Transactions simulated by executing multiple commands atomically • Uses master/slave replication • High-availability cluster in beta • Data operations done via ad hoc commands • Supports a basic pub/sub mechanism Redis offers Consistency and Partition tolerance. Redis is a good option for Application Analytics.

Neo4j

Neo4j is an open source, graph database with the following characteristics:

Page 21: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

21

• Graph oriented database • Node and relationships support key/value properties • Fully ACID • Can easily be embedded • High-availability clustering is available • HA uses master/slave replication, with writable slaves • Online backup is available • Indexes are supported • Ad-hoc query language allows data retrieval • Graph traversal is supported • Algorithm-based traversal is supported. Neo4j offers Consistency and Availability. Neo4J is a good choice for Application Analytics that can be modeled as graphs.  

3.2 Select Anypoint Platform Use Case with NoSQL - Document Storage   Document driven databases like MongoDB are emerging as the preferred data backend for web and mobile applications. The rigid schemas associated with relational databases often makes them ill-suited to store semi-structured content such as product data, user generated content like comments or metadata attached to media files. Scaling relational databases horizontally has also proved to be challenging, pushing developers to new solutions that easily support scaling across commodity hardware and data centers. Anypoint Platform support for document-oriented NoSQL databases like MongoDB, along with its light-weight transformation and data streaming features simplify the implementation of a NoSQL datastore. Some common usage patterns with Anypoint Platform and document stores include the following:   • Batch ingest and transformation of aggregate data accumulated periodically from

external sources. A retailing use case is to use the MongoDB connector to persist

Page 22: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

22

product, stock and tax data into MongoDB collections by periodically polling web services API and data on FTP sites. Anypoint DataMapper can be used to transform the disparate formats of these sources into a canonical JSON format, which can be directly persisted into MongoDB.

• Normalized databases, for good or ill, often long outlive the applications built on top of them. Document stores can be used to build a denormalized layer “on top” of these databases, providing a scalable fabric to facilitate ancillary BigData use cases. Platform support for JDBC makes it easy to maintain denormalized storage view in parallel to a normalized relational database.

   

3.3 Anypoint Platform’s strengths for Integrating NoSQL Databases While other tools may also help solve connections into NoSQL databases, Anypoint Platform has a few areas of strengths that differentiate it. In addition to easy-to-use graphical interface, robust enterprise support, cloud / on-premise deployment options, and top-of-class performance, the platform’s ability to integrate all sources of data and its batch / real-time dual processing capabilities are worth mentioning.

Page 23: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

23

Anypoint Platform is already used by thousands of enterprises as the backbone to integrate a wide variety of data sources, on premise and on the cloud. The table below lists a subset of connectors: • Applications: Twitter, Facebook, Concor, Salesforce, Zuora, Box, Microsoft Dynamics

CRM, Marketo, NetSuite, Google Contacts, Google Calendar, Google Tasks, Google Spreadsheet, Oracle Siebel, Quickbooks, Jira, SAP, Zendesk, Drupal, Service Source, Fresh Books, HubSpot, Yammer, Google Prediction, Documentum, Sugar CRM, Metanga, Jobvite, Paypal

• Databases: Amazon S3, Amazon DynamoDB, HL7, HDFS, Oracle DB, DB2,

PostgreSQL, MySQL, JDBC, OData, Riak, Microsoft SQL Server, MongoDB, Cassandra, neo4j, Redis

• Others: AMQP, RabbitMQ, excelCSV, SSL, SSH, SMTP, ActiveMQ, File, POP3,

WebLogic, AJAX, TCP, Jetty, SwiftMQ, JBoss, WebsphereMQ, FTP, HTTP/HTTPS, SFTP, Apple Push Notification Services, Amazon Notification Services

The Anypoint Platform offers both real-time and batch capabilities. While Anypoint Platform has always been great for handling event-driven, real-time data flows, it also has launched “batch processing”, seeing the need for enterprises to do both. While real-time ingestion and processing are increasingly adopted, some data and processes are still better fit for batch mode. For example, master data may still reside in traditional databases and would be moved in batch in big data use cases. Synchronization among different applications and databases may still be a batch-oriented operation. Anypoint Platfrom possesses the ability to process messages in batches. Within an application, you can initiate a batch job which is a block of code that splits messages into individual records, performs actions upon each record, then reports on the results and potentially pushes the processed output to other systems or queues.  

3.4 APIs for High-Variety Big Data   While some NoSQL solutions, notably MongoDB, offer straightforward query and data representations using standard formats like JSON, others have more obtuse access methods or data representations. Interaction with Cassandra, until recently, required some insight into the Thrift protocol, of which Cassandra is built on top. This is further complicated by competing drivers for the datastores even within a single platform. Redis, for instance,

Page 24: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

24

has 6 Java clients to chose from. Querying Neo4J graph databases requires knowledge of the Cypher query language.

An API layer in front of the datastore can ease access and adoption when working with these databases.

Anypoint Platform’s support for API constructions can simplify adding an API layer in front of your data layer:

• Anypoint APIKit offers a declarative model to facilitate REST API design anddevelopment.

• Anypoint DataMapper can assist in the transformation from the persisted format of thedata to the format being exposed by the API.

The fast-growing consumer electronics retailer, Dick Smith, replaced their legacy systems without business disruption, with the help from Anypoint Platform’s integration and API capabilities. The retailer used Anypoint Platform to manage staged data integration from AS/400, a 20+ years old technology, to Amazon Redshift. APIs are being used to create interfaces for POS devices’ communication. For more, please refer to this article.

Page 25: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

25

3.4.1 API Use Case with NoSQL   A REST API sitting in front of Neo4J that accepts and returns JSON, for instance, would simplify access for developers writing an application on top of it, who aren’t necessarily familiar with graph database concepts or querying with Cypher. Visibility between application interaction is often needed as more and more applications are built on top of APIs. Neo4J, a graph database, can be used to store a graph of dependencies between API’s. Coupled with usage and reliability data captured Anypoint Service Data, this provides the ability to predict the impact of maintenance windows and upgrades. Such data would allow an organization to specifically time a maintenance window, for instance, based on statistically when the last amount of users will be affected. A REST API sitting in front of Neo4J that accepts and returns JSON, for instance, would simplify access for developers writing an application on top of it, who aren’t necessarily familiar with graph database concepts or querying with Cypher.

 

4 Conclusion and MuleSoft’s Big Data Integration Solution   The next phase of big data evolution will be unifying big and small data, and realizing value from faster data and more sources. Addressing velocity and variety challenges require not only having the data and the infrastructure, but more importantly, integrating them with the resource and skill constraints. Smart enterprises adapt their architectures and leverage

Page 26: Big Data’s Velocity and Variety Challenges · 2020-01-07 · 3 1 Introduction In 2001, Doug Laney described big data as cost-effective and innovative processing of high-volume,

26

integration tools, like MuleSoft, to shorten deployment cycle, save costs, avoid chaos and achieve ROI.

MuleSoft is uniquely positioned to offer the one-stop solution for your future-proof big data project implementation. The chart below summarizes MuleSoft’s product solution supporting the two big data challenges.

Tell us about your big data projects and discuss how MuleSoft could help you, please email us at: [email protected].