how can hadoop & sap be integrated

Running together in Retail Environment

1Author: Douglas Bernardini

Big Data Platform

2

Big Data Platform

3

Collection of Hadoop & Apache solutions running together and integrated

Open-source: Apache Software Foundation.

Works across component technologies and integrates with pre-existing EDW, RDBMS and MPP systems.

Linux and Windows.

Authentication, Authorization, & Data Protection.

Native integration with Major BI/ analytics developers & vendors.

HDP Platform overview

Real time Ingest

Flume

Real time Ingest

Storm

Batch Integration

Sqoop

Integration

Processing

YARN

Storage

HDFS

Data Management

Data Access

Script/ETL

PigProcess

MapReduce

SQL like

HiveOnLine

HbaseInMemory

Spark

Hortonworks Data Platform (HDP)

Big Data Platform

4

Scalable: Store/Distribute very large data sets across hundreds of servers operating in parallel, with thousands of nodes involving thousands of terabytes of data.

Cost effective: Savings are staggering, offering computing and storage capabilities for hundreds of dollars per terabyte.

Flexible: an be used for a wide variety of purposes, such as log processing, recommendation systems, data warehousing, market campaign analysis and fraud detection.

Fast: able to efficiently process terabytes of data in just minutes, and petabytes in hours

Resilient to failure: data in individual nodeis also replicated to other nodes in the cluster, which means that in the event of failure.

Hadoop Technology Advantages & Profile

External: In almost off

cases from outside

corporation. Social

networks or suppliers

Source

Big: Normally used for

up to Tens/Hundreds of

terabytes. Petabyte

scale.

Size

Not structured: Data

not separated in

columns/rows or with

schema.

Structure

Data Management

5

Stores data in several clusters & servers

NameNode and DataNodes

Large volume: 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks;

Minimal data motion: Hadoop moves compute processes to the data on HDFS and not the other way around. Moving Computation is Cheaper than Moving Data

Dynamically diagnose: the health of the file system and rebalance the data on different nodes;

Rollback: Allows operators to bring back the previous version of HDFS after an upgrade;

Node redundancy: Supports high availability (HA);

Storage

Hadoop File System HDFS

Data Management

6

Manages HDFS

Multi-tenancy: Multiple access engines to use Hadoop as the common standard for batch, interactive and real-time engines that can simultaneously access the same data set.

Cluster utilization: Dynamic allocation of cluster resources improves utilization over more static MapReduce rules used in early versions.

Scalability: Data center processing power continues to rapidly expand. Resource Manager scheduling clusters expand to thousands of nodes managing petabytes of data.

Processing

Hadoop YARN

Data Access

7

Process Data

The Map function: Divides input data into ranges (parts) by the InputFormat and creates a map task for each range in the input.

JobTracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reduce.

The Reduce function: Collects the various results and combines them to answer the larger problem that the master node needs to solve.

reduce is able to collect the data from all of the maps for the keys and combine them to solve the problem.

Batch

MapReduce

No-structuredData

Map Reduce Data Analysis

Data Integration & Governance

8

High volume data ingestion

Stream data Ingest streaming data from multiple sources into

Hadoop for storage and analysis

Guarantee data delivery Channel-based transactions to guarantee reliable

message delivery.

When a message moves from one agent to another, two transactions are started, one on the agent that delivers the event and the other on the agent that receives the event.

This ensures guaranteed delivery semantics

Scale horizontally To ingest new data streams: Additional volume

Real-time Ingest

Apache FLUME

No-structuredData

AgentNodes

CollectorNodes

HDFS

StorageArea


9

Very-large volume data ingestion

Fast – benchmarked as processing million+ messages/records per second per node

Scalable – with parallel calculations that run across a cluster of machines

Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.

Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.

Real-time Ingest

Apache STORM


10

Connects to traditional RDBMS

Data imports: Moves certain data from external stores and EDWs into Hadoop to optimize cost-effectiveness of combined data storage and processing

Improvements: Compression, indexing for query performance

Parallel data transfer: For faster performance and optimal system utilization

Fast data copies: From external systems into Hadoop

Load balancing: Mitigates excessive storage and processing loads to other systems

Batch Integration

Apache SQOOP

Efficient data analysis: Improves efficiency of data analysis by combining structured data with unstructured data in a schema-on-read data lake

Data Access

11

Easy programing language

Easily programmed: Complex tasks involving interrelated data transformations can be simplified and encoded as data flow sequences. Pig programs accomplish huge tasks, but they are easy to write and maintain

Iterative data processing: Extract-transform-load (ETL) data pipelines. Research tools on raw data.

Extensible: Pig users can create custom functions to meet their particular processing requirements

Self-optimizing: Because the system automatically optimizes execution of Pig jobs, the user can focus on semantics.

Script

Pig

Data Access

12

SQL like tools

Familiar: Query data with a SQL-based language. similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units.

Fast: Interactive response times, even over huge datasets

Partitioned: Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Scalable and Extensible: As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance.

Uses JobTracker (MapReduce) functionalities

SQL

Hive

Data Access

13

NoSQL tools with SQLlike command interface

Apache HBase is an open source NoSQL database that provides real-time read/write access to those large datasets.

Scales linearly to handle huge data sets with billions of rows and millions of columns

Easily combines data sources that use a wide variety of different structures and schemas.

Natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.

Choice for storing semi-structured data like log data.

OnLine

Hbase

Data Access

14

Fast, in-memory data processing to Hadoop.

Elegant and expressive development APIs in Scala, Java, R, and Python.

Allow data workers to efficiently execute streaming, machine learning or SQL workloads for fast iterative access to datasets.

Designed for data science and its abstraction makes data science easier.

Data scientists commonly use machine learning – a set of techniques and algorithms that can learn from data. These algorithms are often iterative

InMemory

Spark

ERP/DW-BI Platform

15

16

Fast In-memory Database

Traditional DBMS: SQL interface, Transactional isolation and recovery (ACID).

Parallel Data Flow Model: calculations can be executed in parallel with distribution across hosts.

Last generation Data Storage:

Columnar and Row-Based

Near to eliminate of indexes.

High Data compression

Automatic recovery: From memory errors without system reboot.

Native tools: Predictive Analysis Library & Analytical and Special Interfaces

SAP/Hana ArchitectureERP/DW-BI Platform

17

100x faster

Optimization: InfoCubes and Datastore Objects (DSO) with better performance.

Faster remodeling: Improved and Lean Data Models.Simplified data modeling and reduced materialized layers

Datamarts: Integrated and embedded flexibility. Also may have OLAP and OLTP are executed in one system.

Increased Flexibility: Optimized Layered Scalable Architecture. Aggregates and Cubes no more required (optional).

Improved response times: for existing transactions and entire business processes through general performance improvement of the underlying HANA database

SAP/Hana Technology Advantages & Profile

Big: Not considered BIG

for web 2.0 era. Tens of

terabytes. Not reaching

Petabytes.

Size

Structured: separated

in columns/rows and

with schema.

Structure

Internal: In almost off

cases from INSIDE

corporation, from

ERP/CRM/SCM.

Source

ERP/DW-BI Platform

SAP/Hana Evolution

Starting Point: SAP Landscape consists of SAP ERP running on a relational database, connected to a OLAP engine (e.g. SAP BI) and perhaps using Business Intelligence Accelerator like BOBJ

AnalyticsSAP/BOBJ

OLTP SAP/ECC

ETL

OLAPSAP/BW

ERP/DW-BI Platform

Introducing HANA in parallel: Install and run the In-Memory engine (HANA) in TOGETHER with traditional SAP instances

02 BW extractors running at same time and exporting same data

Key factor: Real data performance processing COMPARISON

AnalyticsSAP/BOBJ

SAP/HANA2nd ETL

AnalyticsSAP/BOBJ

OLTP SAP/ECC

ETL

OLAPSAP/BW

SAP/Hana Evolution

BW database upgrade: Re-created traditional-style BI in memory

ERP/DW-BI Platform

OLTP SAP/ECC

OLAPSAP/BW

ETL

AnalyticsSAP/BOBJ

SAP/HANA

ERP/BI full database upgrade: Eliminate traditional database and run both instances in In-Memory, using non materialized views

OLTP SAP/ECC Analytics

SAP/BOBJ

SAP/HANA

OLAPBI 2.0

20

Sizing on SAP/HanaERP/DW-BI Platform

• Memory• Traditional sizing:

• CPU performance <> SAP HANA memory.

• Master/transactional data > Main memory.

• Main memory required:

• Storing the business data;

• Temporary memory space ;

• Support complex queries;

• Buffers & Caches;

• CPU• Behaves differently with SAP HANA compared to traditional

databases.

• Querys: Complex & Maximum speed

• Disk Size• Still required disk storage space.

• Preserve database information if the system shuts down (either intentionally or due to a power loss)

• Data changes: Periodically copied to disk (Ensures a full image of the business data on disk)

• Logging mechanism: Enable system recovery.

21

SAP/Hana on VMERP/DW-BI Platform

• SAP HANA on vSphere is fully supported

• Combining SAP HANA and vSphere provides additional benefits with regards to deployment and availability.

• Some remaining customer slots for SAP on SAP HANA controlled availability Proof Of Concepts

• SAP HANA > BlueMedora Plug-In

• Monitor memory and vCPU utilization

• Add/Delete resources

• Underutilized – Deploy more SAP HANA

• Over utilized – SAP HANA unleased

• Workload management

• Determine Consolidation Ratios

• AmazonAWS: SAP Partner

• SAP BW on HANA Trial - PoC

• The AWS server provides a HANA

• Ready to go in 30min

• OLAP: BW on HANA or any other data warehouse application predominantly with OLAP workloads including data marts running a lot of complex queries

• OLTP: any transactional application like Business Suite on HANA predominantly running simple queries or CRUD operations

22

ERP/DW-BI Platform

SAP/Hana on Cloud

• Storage Replication: • The storage itself replicates all data to another

location within one or between several data centers. The technology is hardware vendor-specific and multiple concepts are available on the market.

• System Replication: • SAP HANA replicates all data to another location

within one or between several data centers. The technology is independent from hardware vendor concepts and reusable with a changing infrastructure.

23

ERP/DW-BI Platform

Disaster Recovery

Host Auto-Failover

• Standby mode: • No data; requests or queries.

• When an active (worker) host fails, a standby host automatically takes its place.

• Since the standby host can take over operations from any of the primary hosts, it needs access to all of the database volumes.

• Once repaired:• The failed host can be rejoined to the system as

the new standby host to re-establish the failure recovery capability:

24

ERP/DW-BI Platform

SAP/HANA High Availability

ERP/DW-BI & Big Data Platform

25


26

Not Structured Data

Structured DataERP/CRM/SCM

Real time Ingest

Flume

Real time Ingest

Storm

Batch Integration

Sqoop

Integration

Processing

YARN

Storage

HDFS

Data Management

Data Access

Script/ETL

PigProcess

MapReduce

SQL like

HiveOnLine

HbaseInMemory

Spark

Hortonworks Data Platform

Analytics

Data Repositories

HANA

OLAPEngine

PredectiveEngine

SpatialEngine

AplicationLogic & Rendering

Architecture Proposal


27

Business Case: CRM/RetailInternal structured data source

• Point-of-sale data – Data captured when the customer makes purchases either in-store or on the company’s e-commerce site.(04T)

• Inventory and stock Information –Products are in stock at which locations/promotion. (07T)

• CRM data – From all the interactions the customer has had with the company at support site.(8T)

• Total data Size: 21T

External Unstructured data source

• Social media data – Customer’s social media sentiment analysis, such Facebook (70T)

• Historical Web log information – Record of the customer’s past browsing behavior on the company’s Web site.(30T)

• Geographic customer behavior: Origin/Destiny potential customer nearby stores. (20T)

• Total data Size: 120T


28

Business Case: Data Process

[email protected]

Questions?

29

how can hadoop & sap be integrated

Data & Analytics