big data orchestration - activeeon.com · big data orchestration . mapreduce, hadoop, etl, elt,...

BIG DATA ORCHESTRATION MAPREDUCE, HADOOP, ETL, ELT, INFRASTRUCTURE, CLOUD, WORKFLOW, META-SCHEDULER AND COST OPTIMISATION WHITEPAPER | 2017

1

ActiveEon www.activeeon.com | try.activeeon.com | [email protected] | +1 408 645 5105 | +33 9 88 77 76 60

Executive Summary

While the Big Data ecosystem is becoming more and more complex through additional storage tools, analytical tools, BI tools, etc., a stronger need for a comprehensive Orchestrator is emerging.

The most common tools used for Data Management in enterprise are falling short to plan, orchestrate and pilot the many applications used for Data & Analytics. A comprehensive Orchestrator must handle Workflows, pilot and synchronize simultaneous executions (Meta-Scheduling) of diverse tools, and provision and manage IT resources.

Such a comprehensive orchestrator is able to provide a unified and central management platform for the full automation of all data related processes. Strong ROI comes along such adoption through an industrialization of Big Data, savings on process execution and DevOps times, savings on manpower and IT resources, while gaining on availability.

Contents

Introduction .......................................................................................................................................................... 2

Big Data Main Ecosystem ...................................................................................................................................... 3

Expand Traditional Big Data Capability: add Workflows ........................................................................................ 3

Ensure High Utilization .......................................................................................................................................... 4

Optimize Operational Costs with an Error Management System .......................................................................... 6

Optimize Operational Costs with Single Pane Dashboard ..................................................................................... 6

Get Highly Critical Jobs Prioritized ......................................................................................................................... 6

Secure Your Data and Enforce Compliance ........................................................................................................... 7

Ensure Data Integrity ............................................................................................................................................. 8

Integrate with Any System with an Open Solution ................................................................................................ 8

Orchestration Benefits .......................................................................................................................................... 9

ProActive ............................................................................................................................................................. 10

Introduction to ProActive ................................................................................................................................ 10

Role of ProActive in Big Data area ................................................................................................................... 11

ProActive in Practice ....................................................................................................................................... 12

Conclusion ........................................................................................................................................................... 12

http://www.activeeon.com/

https://try.activeeon.com/

mailto:[email protected]

2


Introduction In today’s world, the adoption of Big Data is critical for most company survival. Storing, processing and extracting value from the data are becoming IT department's’ main focus. The huge amount of data, or as it is called Big Data, have four properties: Volume, Variety, Value and Velocity. Systems such as Hadoop, Spark, Storm, etc. are de facto the main building blocks for Big Data architectures (e.g. data lakes), but are fulfilling only part of the requirements. Moreover, in addition to this mix of features which represents a challenge for businesses, new opportunities will add even more complexity. Companies are now looking at integrating even more sources of data, at breaking silos (variety is increasing with structured and unstructured data), and at real-time and actionable data. All those are becoming key for decision makers.

Fig 1: From data to decisions

Multiple solutions in the market have been supporting Big Data strategies, but none of them fits every company’s use cases. Consequently, each of these solutions will be responsible for extracting some meaning from the data. Although this mix of solutions adds complexity to infrastructure management, it also leverages the full information that can be extracted from the data. New questions are then raised like: How do I break company silos? How to make sense of this pool of unrelated and unstructured data? How to leverage supervised and unsupervised machine learning? How do I synchronize and orchestrate the different Big Data solutions? How can




3


I allocate relevant resources? How do I ensure critical reports get prioritized? How do I enforce data locality rules and spread of the information? How do I monitor the whole data journey?

This paper aims to explore the technical, operational and economic challenges around orchestration solutions. To leverage Big Data, companies will address those in order to optimize its infrastructure, extract faster and deeper insight into the data, and thus get a competitive edge.

Big Data Main Ecosystem Multiple systems such as Hadoop (inc. YARN) and Spark have been using the well-known MapReduce paradigm as their abstract computational concept. The ecosystem of MapReduce and its derivative methods are mature and very good for parallel processing of Big Data. But many of them are an “offline” processing platform, and therefore cannot handle dynamic data streams.

To address the limitation, some stream processing engines have been proposed, some of them are centralized which are not able to process a huge volume of data streams. Others are parallel. Among all the parallel frameworks, Spark Streaming, Yahoo! S4 and Twitter Storm are the most widely used. Spark Streaming is not a “real-time processing” framework because the incoming events are cached and processed as a mini-batch. This results in a larger delay than the real streaming processing frameworks such as Twitter Storm. Yahoo S4! does not provide a dynamic load balancing protocol, which means that we cannot add or delete nodes during runtime.

The management of the entire process is difficult with Hadoop. It relies on heavy interactions with the whole IT infrastructure, interactions with multiple storage places and with multiple processes running in parallel. In this context, ETL (Extract Transform and Load) makes sense in the Big Data world to pre-compute, select and clean data and enable end-users to easily perform analysis. The new trend of ELT (Extract Load and Transform), instead of transforming the data before it’s written, leverages the target system to do the transformation: the data is copied to the target and then transformed in place. In both cases, ETL or ELT are in charge of creating additional views for end users (people or software), and require Synchronization and Orchestration.

For a more technical presentation of the main Big Data framework, see Big Data Landscape whitepaper.

Expand Traditional Big Data Capability: add Workflows One of the main expected behavior from a Big Data system is to move the data according to its utilization. Traditionally, the algorithms first process real-time data in memory before moving them to a datacenter managed by applications such as Hadoop. Some ETL can then Transform and Load some data to be later used by BI (Business Intelligence) tools. Other information can be extracted through MapReduce. Finally, a subset can be used for Machine Learning algorithms to extract deeper information.

Moreover, recent articles converged to show a growing interest in adding data capabilities outside Hadoop solution in Big Data architecture (e.g. logical data warehouse accessible through APIs). Companies are pushing to break silos and extract more information from this data. The data used for analysis is now being collected from multiple points which adds requirements, and call for a higher-level tool to Organize, Synchronize, and above all Automate. An advanced workflow system has been used by companies to support those needs. Each workflow allows to automatically extract, transform, load, and synchronize the data from many sources, using




http://www.activeeon.com/register/file-download/activeeon-whitepaper-big-data-landscape.pdf

4


many tools and frameworks. Companies break silos, gain in flexibility and agility. Moreover, workflows enable such tasks to be performed sequentially or in parallel very efficiently.

For example, clear and graphical workflows (Fig. 2) enable users to visualize the data journey, synchronize tasks through dependencies, identify which information can be extracted at a glance, and ensure a good data flow. Moreover, managing the processes at such a high level also enable prioritization for critical processes, error management, resilience and high availability of data, reports and analysis of the system.

Fig 2: A Data Processing Workflow with clear Dependencies, Synchronizations, and Parallel Executions

Ensure High Utilization On the path to leverage Big Data, optimizing resource utilization is critical. This can be achieved by programming job dependencies, allocating resources to each task, and by managing globally the overall resource pool.

As Adrian Cockcroft, while cloud computing leader at Netflix, said “If you build applications that assume the machines are ephemeral and can be replaced in a few minutes or even seconds, then you end up building an application that is cost-aware”. Cost savings can then be achieved using an orchestration tool which includes a resource manager. The solution will be aware of future resource needs and adequately organize the pool of resources to match future demand. IT will be enabled to configure smart behaviors within their system such as cloud bursting to collect additional resources when required, resource failure management to reschedule tasks




5


and try to recover failing resources and resource allocation optimization to parallelize tasks using different resources (e.g. CPU and GPU), etc. (Fig. 3).

Most solutions such as Hadoop, Spark and Storm with embedded resource manager do not have a global vision of the infrastructure and will benefit from a global scheduler or as it is called meta-scheduler. Indeed, a meta-scheduler with a resource manager enables resources to be balanced according to the overall processes running to ensure higher utilization of the overall resource pool and meet individual job deadlines.

A CPU consuming application, such as a risk calculation on portfolio exposure to events, can run in parallel with applications which are RAM intensive, such as merging different exposure risks. This allows for a more efficient use of a specific machine, with two or more tasks running in parallel.

A Hadoop job completed will then allow a meta-scheduler to plan another job on the available resource depending on resource requirements, dependencies, priorities, results returned, etc.

Fig 3: Optimized scheduling according to resource utilization

The boundaries between an Orchestrator and a Meta-Scheduler are sometimes difficult to define. One can say that an Orchestrator needs the capacity to Schedule Workflows, while some Meta-Scheduler might only execute simple Tasks without dependencies, and might also lack Resource Management.

The ideal situation is an Orchestration solution that can:

1. manage Workflows and Resources natively, 2. act as a Meta-Scheduler to delegate task execution to native tools when needed, 3. pilot resources to be given to each underlying solution (Fig. 4).




6


Optimize Operational Costs with an Error Management System There are many reasons why any calculation performed can be lost in a resource consuming workload: network or hardware issues, unavailable information to collect, etc. None of these errors are acceptable. Fortunately, it is possible to enable users, at a global architecture level, to benefit from auto-remediation, pause on error, human approval requests, etc. with an appropriate solution.

Setting up error handling policies will also help organizations take action to fix errors and resume a workload. The main features which are relevant for a complete error management system at a Job (ordered list of tasks with dependencies, the executing instance of a Workflow) and Task (smallest unit of work, can be a native task, a script, a Docker container, etc.) level are:

• Policy selection: cancel, pause, or continue execution, • Automatic rescheduling of tasks in case of failure on the same or different node, • Behavior control on task or resource failure, • Offline fixing and task retry, • Notifications.

Moreover, an error management system allows IT departments to optimize internal IT resources and to save on Clouds, for instance by using AWS Spot Instances and Google GCP Preemptible VMs. Indeed, these solutions are significantly cheaper than regular on-demand resources and their instability can be managed by this error handling system.

Optimize Operational Costs with Single Pane Dashboard A unified platform to manage heterogeneous tasks and resources has multiple benefits. Local resources can be used in a similar way as any public or private cloud. Companies can then leverage each provider strength and be more flexible. The interface given to developers (e.g. REST API) is standardized. This allows plugins and connectors to support more systems. It also helps products evolve with changes in tasks and resources. Collected metrics are also normalized to offer better comparison.

Furthermore, the main interfaces offer operators the ability to check the status of each process, identify errors, extract the logs and take immediate action accordingly at a glance. Information is consequently quickly available and maintenance is eased by the standardization of the diverse interfaces (e.g. logs will be accessible in a unified way for any type of server or VM used).

Get Highly Critical Jobs Prioritized As stated previously, not all jobs have the same impact on the business and the same value with time. For instance, fraud detection is a real-time process. A hint of a fraud will trigger additional jobs to confirm it and block the transaction until it gets human approval. Another example can be found for risk management




7


algorithms which will need to produce reports, alerts and opportunities that lead to immediate positioning on the stock market.

Custom scripts do not have the overall view of running processes and only support basic priority system. A meta-scheduler enables running jobs to be “paused” while saving intermediary work and immediately allocating the “preempted’’ resources to higher priority jobs. This priority feature is configured for each user group to ensure optimum behavior based on company policies.

Secure Your Data and Enforce Compliance Nowadays, governments are putting more pressure on data sovereignty and data locality. Companies aiming to leverage Big Data are consequently challenged to secure their data and infrastructure. To support this need, orchestration tools include what is called “selection scripts” to select appropriate resources (nature, location, etc.) for each task. This ensures critical processes are run in the right place and meet government and IT rules. For instance, as shown in Fig. 4, the orchestrator will run solutions such as Kibana, Elasticsearch, SAS, Tibco Spotfire, Anaconda, etc. only on specific environments even though the resource pool is global.

Regarding securing their data, companies may deploy Layer 3 network on their datacenters with the goal to ensure higher security with Firewall (e.g. Fig. 4). However, this creates new challenges to send data to different places for analysis since it creates subnetwork(s) which may be isolated. Advanced orchestration tools have the ability to go through these barriers and ensure that the selected data flows appropriately without breaking any security rules. For instance, the result from an analysis in a secured environment can then be transferred to another public database for BI tools to analyze.

Fig 4: Meta-scheduling of Big Data applications over secured environments




8


As shown in Fig. 4, a Meta-scheduler with a Resource Manager can orchestrate various applications in different environments and allocate appropriate resources to each one. The resource pool can then be shared across environments and additional savings can be made.

Ensure Data Integrity In addition to data security, integrity also represents a challenge for companies. Generally, a unique storage space will store all raw data to represent a single version of the “truth”. This storage can be seen as Atomic since it is used to build more complex views of the information. For instance, view such as POLE (People, Objects, Locations, Events) can be created from this Atomic data. Companies need to restrict access through a RBAC (Role-Based Access Control) system in order to carefully perform changes and maintenance on this Atomic database. This type of architecture is usually seen with Big Data benefits from orchestration tools.

To ensure high availability, databases are usually replicated. Changes have to be made in parallel for consistency. A meta-scheduler with a resource manager enables parallelization and will monitor task progress. In case of failure, rollback processes can be triggered without corrupting the source.

To securely perform changes and maintenance, standardization is key. A catalog of thoroughly tested processes is a requirement. Actions are then logged and controlled for any user meeting the relevant access rights.

Integrate with Any System with an Open Solution Open Source solutions are exposed to public eyes. Following best practices is enforced by a growing community. This is why plugins, connectors and new features are usually well designed, robust and often follow a Micro-Service Architecture. This type of architecture offers various advantages, such as superior testability, fine-grained scalability, and high maintainability.

Agile developments, continuous integration, and continuous deployment are trending practices followed rigorously by open source development teams. These practices guarantee high quality and client-centric products delivered regularly.




9


Fig 5: Comprehensive Open REST API for full integration with existing systems, third party software and services

Orchestration tools aim at integrating with many third-party solutions. Being open source exposes the different services to more use cases allowing greater flexibility. Fig. 5 shows the range of main services accessible through a Rest API, which allows for full integration with existing platforms, business applications, and running services.

Finally, depending on the flexibility of the workflows & scheduling tool utilized, it will help in expansion, federation, integration projects, and ensure a better return on investment.

Orchestration Benefits As it has been presented in the previous sections, getting insight from a large amount of collected data is complex and multiple parameters have to be taken into account. Various solutions from Hadoop to Storm offer ways to partially extract information along the data journey and answer various challenges (volume, variety, value and velocity). New tools are also coming into this ecosystem to add metadata, tag, prepare data, offer SQL access, etc.

To organize all of these solutions and optimize parallelization, an orchestrator or meta-scheduler is required (Fig. 6). This will pilot diverse applications to ensure the process flow is respected, ensure the data follow government and company policies (e.g. data locality), handle error management and ensure full integration with other solutions such as BI tools, reports, etc. Moreover, to overcome challenges faced by each individual solution, orchestrators enable secured data transfers, enable resource selection through firewalls, balance the overall load efficiently, etc.




10


Fig 6: Big Data simplified ecosystem and data journey

The next section briefly presents such a comprehensive Orchestration solution available in Open Source.

ProActive

Introduction to ProActive ProActive Parallel Suite is an Open Source solution mainly supported by ActiveEon. ActiveEon is specialized in IT automation and digital transformation of scalable solutions. Its expertise includes cloud migration, IT transformation, Big Data & IoT. Its solution translates business processes into expressive computing workflows. Job parallelization and distribution allows to accelerate business processes and reduce infrastructure costs.




11


Fig 7: Flexible orchestration tool for Big Data and standard solution over any infrastructure

Fig. 7 above shows that ProActive, at its core, features a comprehensive Workflow system, together with a native scheduler that can be used in standalone mode. Moreover, the solution includes resource management, as it is interfaced with native machines, virtual machines (VMware, OpenStack, CloudStack), and all the Public Clouds (Amazon AWS, Azure, Google GCP).

Role of ProActive in Big Data area In this Big Data ecosystem, ProActive Open Source solution fits into two main areas.

ProActive has proven records in processing optimization (accelerating Workload completion) through distribution and parallelization which makes it suitable for long and complex analysis. By managing closely the diversity of resources available for a company or business unit and by understanding algorithm dependencies and requirements, businesses are getting insights into their data faster, at a cheaper cost while keeping control of the execution. Multiple languages are supported such as R and Python which are the most common languages used to extract deeper information from the database.

ProActive has also been used as a meta-scheduler and orchestrator for advanced architectures which have to balance security rules, fast processing, information accessibility, governance, and third party software interactions. It provides the ability to optimize data transfers through advanced workflows and resource selection, including through layer 3 networks and firewalls (Fig.4). Its global view on the architecture enables load balancing and secure synchronization of multiple processes.

Finally, the Open Source approach followed by ActiveEon means greater flexibility which eases integration with existing IT architectures.




12


ProActive in Practice As a real life example, Figure 8, represents in grey the unused resources across time with ProActive.

Fig 8: Optimized Resource Consumption

Each color represents a different task consuming different type of resources. “The net effect of ProActive allowed for 10% overall savings in runtime and grid resources and for the high priority risk reports to be made available to customers over 3x faster, from 16 Hours to 5 Hours!” said the Lead Integration Engineer from Legal & General. You can check out the customer testimonial here.

Conclusion The Big Data ecosystem is becoming more and more complex through additional storage tools, analytical tools, BI tools, etc. It is becoming clear that there is a strong need for a comprehensive Orchestrator in the enterprise data ecosystem, going from data collection to data decision, and using industrialized business processes.

Most of the tools classically used in the enterprise and in relation with Data management, like ERP Oracle, JD Edwards (ERP), SAS, IBM Cognos, TIBCO Spotfire (Analytics), IBM Maximo (Factory maintenance), the various ETLs, and the Hadoop ecosystem itself, are falling short to provide planning and orchestration that can span all applications.

A comprehensive Orchestrator must fully manage and handle all-inclusive Workflows, must have the capacity to pilot and to synchronize the executions (Meta-Scheduling) of the many tools of the data chain (from data acquisition to data storage to advanced analytics), and finally must be able to provision and manage IT resources, for itself, and for the other tools.

With those 3 fundamental features, the Orchestrator is able to provide a central place of control, for automation, for error recovery, and for resource management. The net benefits besides the industrialization of Big Data exploitation, are big savings on execution time, DevOps time, manpower, and IT resources, while gaining on availability.




http://www.cloudexpoeurope.com/2017-programme/digital-transformation--solvency-ii-simulations-for-lg-optimizing-accelerating-and-migrating-to-the-cloud

big data orchestration - activeeon.com · big data orchestration . mapreduce, hadoop, etl, elt,...

Documents