sap hana and apache hadoop for big data management (sf scalable systems meetup)

30
Will Gardella, Senior Director SAP Applied Research - Big Data Program [email protected] HANA & Hadoop for Big Data Management

Upload: will-gardella

Post on 10-May-2015

7.000 views

Category:

Technology


2 download

DESCRIPTION

In this presentation I argue that the future of data management may see a split between (1) real-time in-memory systems such as SAP HANA for most enterprise workloads (2) disk-based free and open-source Apache Hadoop for certain specialized big data uses. The presentation starts with a definition of what is intended by the term big data, then talks about SAP HANA and Apache Hadoop from the perspective of suitability for enterprise use with a special concentration on Hadoop. (The basics of SAP HANA were covered in the immediately preceding session). This is followed by a description of currently available SAP support for Apache Hadoop in SAP BI 4.0 and SAP Data Services / EIM. Due to time constraints I did not discuss Apache Hadoop support built into Sybase IQ.

TRANSCRIPT

Page 1: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

Will Gardella, Senior Director

SAP Applied Research - Big Data Program

[email protected]

HANA & Hadoop for

Big Data Management

Page 2: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 2

Safe Harbor Statement

The information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission of SAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAP has no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or release any functionality mentioned therein. This document, or any related presentation and SAP's strategy and possible future developments, products and or platforms directions and functionality are all subject to change and may be changed by SAP at any time for any reason without notice. The information on this document is not a commitment, promise or legal obligation to deliver any material, code or functionality. This document is provided without a warranty of any kind, either express or implied, including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This document is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors or omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent. All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as of their dates, and they should not be relied upon in making purchasing decisions.

Page 3: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

SAP REAL-TIME DATA PLATFORM

A GAME-CHANGER

Information Management & Real-Time Data Movement

Transactional

Data

Management

In-Memory

Data

Management

Analytics

EDW Data

Management

Mobile Data

Management

Com

mon D

esig

n &

Modelli

ng

Environm

ent

Com

mon L

andscape

Enviro

nm

ent

Open APIs and Protocols

SAP real time data platform

Federated Access

Page 4: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 4

Traditional data management approaches are changing

?

100101 011010 100101

1980s / 1990s Today

Page 5: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 5

What is Big Data? The 3 + 1 V’s

Explosion in the

amount of data

Fast collection, processing

and consumption

Multiple data formats;

non-structured data boom

Velocity

Variety

Volume

Data

Vo

lum

e in

Wh

ole

Wo

rld

Customer

Data

Automobiles

Machine Data

Smart Meter

7.9 Zettabytes

! Point of Sale

Mobile

Structured Data

Click Stream

Social

Network

Location-

based Data

Text Data

IMHO, it’s great!

RFID

Future 2015 2011

1.8

Zettabytes Value Keep everything, not only

high value data

Page 6: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 6

HANA for Big Data

Key Characteristics: in-memory, row & column, real time

How’s HANA for Big Data?

Volume: Billions of records

Variety: Text processing & search

Velocity: Real time

Value: High value data

Page 7: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 7

Data management today: systems optimize for speed or capacity S

peed

Size

Near 0 marginal

storage cost

Most enterprise systems

today not too large for

real time

Page 8: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 8

New storage and processing techniques required

Columnar

Distributed

In-memory

Row

Real-time queries

High value data

Targeted data read

Batch queries

Flexible data sets

All data read

Page 9: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 9

Information Views

EDW / Data Marts

Data Mining /

Predictive Analysis

Analytic Data Warehouse Real-time

Database

Insight Discovery Real-time Operations

Business

Applications & Processes

Analytic Tools, Custom Data

Analysis Applications BI Tools

Business Intelligence

Text Analysis Real-time Loading

Building an IT landscape for Big Data

ETL, Data Quality

Present

Process

Store

Ingest

Page 10: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

Hadoop

Page 11: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 11

What is Apache Hadoop?

Apache Hadoop is open source software that enables reliable, scalable, distributed computing on

clusters of inexpensive servers

Reliable

Software is fault tolerant, it expects and handles hardware and software failures

Scalable

Designed for massive scale of processors, memory, and local attached storage

Distributed

Handles replication. Offers massively parallel programming model, MapReduce

Hadoop framework handles: partitioning, scheduling, dispatch, execution, communication, failure

handling, monitoring, reporting and more

Page 12: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 12

The Apache Hadoop technology family logical view*

Hive “Data warehouse” that provides SQL

interface. Data structure is projected ad

hoc onto unstructured underlying data

Pig Platform for manipulating and

analyzing large data sets.

Scripting language for analysts

HBase Column oriented, schema-less, distributed

database modeled after Google’s

BigTable. Random realtime read/write

Hadoop Common

MapReduce Distributes & monitors tasks, restarts

failed work

HDFS Distributes & replicates data across

machines

MapReduce Parallel programming

Large block data handling

(e.g. 64MB)

Non-Relational DB

Fine-grained data handling

Scripting

* For simplicity, mappings to servers is omitted

Mahout Machine learning libraries for

recommendations, clustering,

classification and itemsets

Machine Learning

Page 13: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 13

What does Hadoop bring to the table?

Batch Processing

Where fast response times are less critical than reliability and scalability

Complex Information Processing

Enable heavily recursive algorithms, machine learning, & queries that cannot be easily expressed in SQL

Low Value Data Archive

Data stays available, though access is slower

Post-hoc Analysis

Mine raw data that is either schema-less or where schema changes over time

Cost efficient data storage and processing for large volumes of structured, semi-structured, and

unstructured data such as web logs, machine data, text data, call data records (CDRs), audio, video data

Page 14: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 14

Example: Retail Point of Sales Demo Scenario

HANA

9 TB Web Logs

Hadoop

1.1 Billion POS Records

SAP BusinessObjects

Explorer

18 Million Visitor

Session Records

http://youtu.be/HmmPje38e1k

Page 15: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 15

Apache Hadoop bottom line

Strengths Weaknesses

+ Huge data volumes

+ Unstructured data

+ Reliable

+ Scalable

+ Lowest cost

+ Open source

+ No hardware lock in

+ Batch processing

- Not efficient at small scale

- Real time is best case challenging, typically not possible

- Requires skilled engineering, operation and analyst resources

- Hiring qualified talent

- Less mature than SQL

- Governance

- Lack of user role support in access model

Page 16: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 16

Information Views

EDW / Data Marts

Data Mining /

Predictive Analysis

Analytic Data Warehouse Real-time

Database

Insight Discovery Real-time Operations

Business

Applications & Processes

Analytic Tools, Custom Data

Analysis Applications BI Tools

Business Intelligence

Text Analysis Real-time Loading

Hadoop & Enterprise Information Management

ETL, Data Quality

Present

Process

Store

Ingest

A

Page 17: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 17

IT administrator: Extract, transform, and load data quickly

Any Source

Data Load

Metadata

Open Hub

SAP Data Integrator

Database Engine

Modeler

SAP HANA or

SAP Sybase IQ

BW

Repository

Server

Designer and

Management

Console

Page 18: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 18

Loading data from Hadoop into your database

Job

Process

HQL Generator

Hive Result

set

HDFS

FileReader

Process

Data

Database

Loader

ODBC/

JDBC

driver

SAP Data Services

1. Based on target, SAP Data Services

translates queries into:

o Hive Query Language (HQL) Hive

o Pig script HDFS

2. Hive/Pig converts queries to

Map/Reduce jobs

3. Result data files are generated on the

HDFS system

4. SAP Data Services use multiple

threads to process data from Hive/Pig

5. Optional transforms: Data quality

operations

6. Load results into database

1

2

3

4

5 6

Pig

Generator

HDFS Join tables, order /

filter data, apply

functions

Text data

processing

M/R

M/R

Page 19: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 19

SAP Data Services: Simple GUI build and run ETL process

Push down to Hadoop

through HiveSQL

or Pig Scripts

Bulk load

into EDW

Page 20: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 20

Processing text to extract relevant data from Hadoop

Use SAP Data Services to extract:

Core entities (who, what, when, where, etc.)

Domains (voice of customer, public sector, enterprise, etc)

Sentiment analysis (strong positive, weak positive,

neutral, weak negative, strong negative)

Perform transformations

Map text into pre-defined structures

Cleanse, match, de-duplicate data

Load results quickly into EDW

Map text to structure

1

2

3

Page 21: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 21

Information Views

EDW / Data Marts

Data Mining /

Predictive Analysis

Analytic Data Warehouse Real-time

Database

Insight Discovery Real-time Operations

Business

Applications & Processes

Analytic Tools, Custom Data

Analysis Applications BI Tools

Business Intelligence

Text Analysis Real-time Loading

Hadoop Analytics

ETL, Data Quality

Present

Process

Store

Ingest

B

Page 22: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 22

Business Analyst: Viewing data in Hadoop using GUI tools

Automatically generates HiveQL statements that are executed on a Hadoop cluster

Page 23: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 23

SAP BusinessObjects BI: Hadoop for Business Analysts

Common user experience for all front-end tools

Web Intelligence Crystal Reports Dashboards Explorer

Empower all analysts,

enable all workflows

All data sources

SAP BW HADOOP HIVE

3rd party databases

Web Services

Files SAP HANA Sybase databases

Extract, define, &

manipulate metadata

Best access method for each specific data source

Direct Access Universe Access High performance,

feature rich, secure

Page 24: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 24

Simple Tools for the BI administrator to define data access

Build a Data Foundation against a Hive schema

■ Draw joins between Hive tables, aliases, derived tables,

Hive views and Hive partitioned tables

Page 25: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 25

Data Scientist: Flexibility is of the essence

Chooses the best tool based on data mining technique

Chooses best analysis engine based on algorithm & data

100101 011010 100101

Chooses the variables that offer the most promise

Page 26: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 26

Data Scientist: Open integration is key

M/R

algorithms

In-database

Analytics In-database

Analytics

Page 27: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 27

Engage

Ingest

Process

Store

Information Views

EDW / Data Marts

Data Mining /

Predictive Analysis

Analytic Data Warehouse Real-time

Database

Insight Discovery Real-time Operations

Business

Applications & Processes

Analytic Tools, Custom Data

Analysis Applications BI Tools

Business Intelligence

Text Analysis Real-time Loading

Building an IT landscape for Big Data

ETL, Data Quality

Page 28: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

Questions?

[email protected]

Page 29: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 29

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express

permission of SAP AG. The information contained herein may be changed without prior notice.

Some software products marketed by SAP AG and its distributors contain proprietary software components of

other software vendors.

Microsoft, Windows, Excel, Outlook, PowerPoint, Silverlight, and Visual Studio are registered trademarks of

Microsoft Corporation.

IBM, DB2, DB2 Universal Database, System i, System i5, System p, System p5, System x, System z, System

z10, z10, z/VM, z/OS, OS/390, zEnterprise, PowerVM, Power Architecture, Power Systems, POWER7,

POWER6+, POWER6, POWER, PowerHA, pureScale, PowerPC, BladeCenter, System Storage, Storwize,

XIV, GPFS, HACMP, RETAIN, DB2 Connect, RACF, Redbooks, OS/2, AIX, Intelligent Miner, WebSphere,

Tivoli, Informix, and Smarter Planet are trademarks or registered trademarks of IBM Corporation.

Linux is the registered trademark of Linus Torvalds in the United States and other countries.

Adobe, the Adobe logo, Acrobat, PostScript, and Reader are trademarks or registered trademarks of Adobe

Systems Incorporated in the United States and other countries.

Oracle and Java are registered trademarks of Oracle and its affiliates.

UNIX, X/Open, OSF/1, and Motif are registered trademarks of the Open Group.

Citrix, ICA, Program Neighborhood, MetaFrame, WinFrame, VideoFrame, and MultiWin are trademarks or

registered trademarks of Citrix Systems Inc.

HTML, XML, XHTML, and W3C are trademarks or registered trademarks of W3C®, World Wide Web

Consortium, Massachusetts Institute of Technology.

Apple, App Store, iBooks, iPad, iPhone, iPhoto, iPod, iTunes, Multi-Touch, Objective-C, Retina, Safari, Siri,

and Xcode are trademarks or registered trademarks of Apple Inc.

IOS is a registered trademark of Cisco Systems Inc.

RIM, BlackBerry, BBM, BlackBerry Curve, BlackBerry Bold, BlackBerry Pearl, BlackBerry Torch, BlackBerry

Storm, BlackBerry Storm2, BlackBerry PlayBook, and BlackBerry App World are trademarks or registered

trademarks of Research in Motion Limited.

© 2012 SAP AG. All rights reserved.

Google App Engine, Google Apps, Google Checkout, Google Data API, Google Maps, Google Mobile Ads,

Google Mobile Updater, Google Mobile, Google Store, Google Sync, Google Updater, Google Voice,

Google Mail, Gmail, YouTube, Dalvik and Android are trademarks or registered trademarks of Google Inc.

INTERMEC is a registered trademark of Intermec Technologies Corporation.

Wi-Fi is a registered trademark of Wi-Fi Alliance.

Bluetooth is a registered trademark of Bluetooth SIG Inc.

Motorola is a registered trademark of Motorola Trademark Holdings LLC.

Computop is a registered trademark of Computop Wirtschaftsinformatik GmbH.

SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP BusinessObjects Explorer, StreamWork,

SAP HANA, and other SAP products and services mentioned herein as well as their respective logos are

trademarks or registered trademarks of SAP AG in Germany and other countries.

Business Objects and the Business Objects logo, BusinessObjects, Crystal Reports, Crystal Decisions, Web

Intelligence, Xcelsius, and other Business Objects products and services mentioned herein as well as their

respective logos are trademarks or registered trademarks of Business Objects Software Ltd. Business Objects

is an SAP company.

Sybase and Adaptive Server, iAnywhere, Sybase 365, SQL Anywhere, and other Sybase products and services

mentioned herein as well as their respective logos are trademarks or registered trademarks of Sybase Inc.

Sybase is an SAP company.

Crossgate, m@gic EDDY, B2B 360°, and B2B 360° Services are registered trademarks of Crossgate AG

in Germany and other countries. Crossgate is an SAP company.

All other product and service names mentioned are the trademarks of their respective companies. Data

contained in this document serves informational purposes only. National product specifications may vary.

The information in this document is proprietary to SAP. No part of this document may be reproduced, copied,

or transmitted in any form or for any purpose without the express prior written permission of SAP AG.

Page 30: SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

© 2012 SAP AG. All rights reserved. 30

© 2012 SAP AG. Alle Rechte vorbehalten.

Weitergabe und Vervielfältigung dieser Publikation oder von Teilen daraus sind, zu welchem Zweck und in

welcher Form auch immer, ohne die ausdrückliche schriftliche Genehmigung durch SAP AG nicht gestattet.

In dieser Publikation enthaltene Informationen können ohne vorherige Ankündigung geändert werden.

Die von SAP AG oder deren Vertriebsfirmen angebotenen Softwareprodukte können Softwarekomponenten

auch anderer Softwarehersteller enthalten.

Microsoft, Windows, Excel, Outlook, und PowerPoint sind eingetragene Marken der Microsoft Corporation.

IBM, DB2, DB2 Universal Database, System i, System i5, System p, System p5, System x, System z, System

z10, z10, z/VM, z/OS, OS/390, zEnterprise, PowerVM, Power Architecture, Power Systems, POWER7,

POWER6+, POWER6, POWER, PowerHA, pureScale, PowerPC, BladeCenter, System Storage, Storwize,

XIV, GPFS, HACMP, RETAIN, DB2 Connect, RACF, Redbooks, OS/2, AIX, Intelligent Miner, WebSphere,

Tivoli, Informix und Smarter Planet sind Marken oder eingetragene Marken der IBM Corporation.

Linux ist eine eingetragene Marke von Linus Torvalds in den USA und anderen Ländern.

Adobe, das Adobe-Logo, Acrobat, PostScript und Reader sind Marken oder eingetragene Marken von

Adobe Systems Incorporated in den USA und/oder anderen Ländern.

Oracle und Java sind eingetragene Marken von Oracle und/oder ihrer Tochtergesellschaften.

UNIX, X/Open, OSF/1 und Motif sind eingetragene Marken der Open Group.

Citrix, ICA, Program Neighborhood, MetaFrame, WinFrame, VideoFrame und MultiWin sind Marken oder

eingetragene Marken von Citrix Systems, Inc.

HTML, XML, XHTML und W3C sind Marken oder eingetragene Marken des W3C®, World Wide Web

Consortium, Massachusetts Institute of Technology.

Apple, App Store, iBooks, iPad, iPhone, iPhoto, iPod, iTunes, Multi-Touch, Objective-C, Retina, Safari, Siri

und Xcode sind Marken oder eingetragene Marken der Apple Inc.

IOS ist eine eingetragene Marke von Cisco Systems Inc.

RIM, BlackBerry, BBM, BlackBerry Curve, BlackBerry Bold, BlackBerry Pearl, BlackBerry Torch, BlackBerry

Storm, BlackBerry Storm2, BlackBerry PlayBook und BlackBerry App World sind Marken oder eingetragene

Marken von Research in Motion Limited.

Google App Engine, Google Apps, Google Checkout, Google Data API, Google Maps, Google Mobile Ads,

Google Mobile Updater, Google Mobile, Google Store, Google Sync, Google Updater, Google Voice,

Google Mail, Gmail, YouTube, Dalvik und Android sind Marken oder eingetragene Marken von Google Inc.

INTERMEC ist eine eingetragene Marke der Intermec Technologies Corporation.

Wi-Fi ist eine eingetragene Marke der Wi-Fi Alliance.

Bluetooth ist eine eingetragene Marke von Bluetooth SIG Inc.

Motorola ist eine eingetragene Marke von Motorola Trademark Holdings, LLC.

Computop ist eine eingetragene Marke der Computop Wirtschaftsinformatik GmbH.

SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP BusinessObjects Explorer, StreamWork,

SAP HANA und weitere im Text erwähnte SAP-Produkte und Dienstleistungen sowie die entsprechenden

Logos sind Marken oder eingetragene Marken der SAP AG in Deutschland und anderen Ländern.

Business Objects und das Business-Objects-Logo, BusinessObjects, Crystal Reports, Crystal Decisions,

Web Intelligence, Xcelsius und andere im Text erwähnte Business-Objects-Produkte und Dienstleistungen

sowie die entsprechenden Logos sind Marken oder eingetragene Marken der Business Objects Software Ltd.

Business Objects ist ein Unternehmen der SAP AG.

Sybase und Adaptive Server, iAnywhere, Sybase 365, SQL Anywhere und weitere im Text erwähnte Sybase-

Produkte und -Dienstleistungen sowie die entsprechenden Logos sind Marken oder eingetragene Marken der

Sybase Inc. Sybase ist ein Unternehmen der SAP AG.

Crossgate, m@gic EDDY, B2B 360°, B2B 360° Services sind eingetragene Marken der Crossgate AG in

Deutschland und anderen Ländern. Crossgate ist ein Unternehmen der SAP AG.

Alle anderen Namen von Produkten und Dienstleistungen sind Marken der jeweiligen Firmen. Die Angaben im

Text sind unverbindlich und dienen lediglich zu Informationszwecken. Produkte können länderspezifische

Unterschiede aufweisen.

Die in dieser Publikation enthaltene Information ist Eigentum der SAP. Weitergabe und Vervielfältigung dieser

Publikation oder von Teilen daraus sind, zu welchem Zweck und in welcher Form auch immer, nur mit

ausdrücklicher schriftlicher Genehmigung durch SAP AG gestattet.