hadoop as storage for aging data

44
© 2015 IBM Corporation DDB-1560: Hadoop as Storage for Aging Data - Simplicity with Big SQL and DB2 Eugen Stoianovici ([email protected] ) Software Developer , IBM 02:00 pm - 03:00 pm Thursday, October 29 th , 2015

Upload: hadoop-dev

Post on 22-Jan-2018

785 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Hadoop as Storage for Aging Data

© 2015 IBM Corporation

DDB-1560:Hadoop as Storage for Aging Data -Simplicity with Big SQL and DB2

Eugen Stoianovici ([email protected])

Software Developer , IBM

02:00 pm - 03:00 pm

Thursday, October 29th, 2015

Page 2: Hadoop as Storage for Aging Data

• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal

without notice at IBM’s sole discretion.

• Information regarding potential future products is intended to outline our general product direction

and it should not be relied on in making a purchasing decision.

• The information mentioned regarding potential future products is not a commitment, promise, or

legal obligation to deliver any material, code or functionality. Information about potential future

products may not be incorporated into any contract.

• The development, release, and timing of any future features or functionality described for our

products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a

controlled environment. The actual throughput or performance that any user will experience will vary

depending upon many factors, including considerations such as the amount of multiprogramming in the

user’s job stream, the I/O configuration, the storage configuration, and the workload processed.

Therefore, no assurance can be given that an individual user will achieve results similar to those stated

here.

Please Note:

2

Page 3: Hadoop as Storage for Aging Data

• Problem description – data aging

• Big SQL - SQL on Hadoop for online query archive

• Data movement tools with Big Insights

• Big SQL – Tips for maximizing usability

Agenda

2

Page 4: Hadoop as Storage for Aging Data

Modern Warehouse

Exploration,Integrated

Warehouse, and Mart Zones

Discovery

Deep Reflection

Operational

Predictive

All Data Sources

Decision

Management

BI and Predictive

Analytics

Analytic

Applications

Intelligence

Analysis

Raw DataStructured DataText AnalyticsData MiningEntity AnalyticsMachine Learning

Video/Audio

Network/Sensor

Entity Analytics

Predictive

Stream Processing Data Integration Master Data

Streams

Information Governance, Security and Business Continuity

Analytic Applications

Real-time Analytic Zone

Landing Area, Analytics Zone

and Archive

Information Ingestion and Operational Information

Page 5: Hadoop as Storage for Aging Data

IBM BigInsights for Hadoop

Discovery

Deep Reflection

Operational

Predictive

All Data Sources

Decision

Management

BI and Predictive

Analytics

Analytic

Applications

Intelligence

Analysis

Raw DataStructured DataText AnalyticsData MiningEntity AnalyticsMachine Learning

Stream Processing Data Integration Master Data

Streams

Information Governance, Security and Business Continuity

Analytic Applications

IBM Big Insights for Hadoop

Exploration,Integrated

Warehouse, and Mart Zones

Landing Area, Analytics Zone

and Archive

Information Ingestion and Operational Information

Page 6: Hadoop as Storage for Aging Data

• Create a large area of storage for cold data

• Immediate cost savings, freeing EDW capacity

Placing lowest detailed schemas in BigInsights / Hadoop

Offloading ETL into a cheaper, more elastic MPP environment

Removing cost of HA/DR from data

• Data remains online, instead of nearline or on tape

More detailed and more variety of data

Retain longer history of detailed data

Queryable Archive

5

Page 7: Hadoop as Storage for Aging Data

• Hadoop is not a piece of software, you can't install “Hadoop"

• It is an ecosystem of software that work together

Hadoop Core (API's)

HDFS (File system)

MapReduce (Data processing )

Hive (SQL access)

HBase (NoSQL database)

Spark (compute engine)

Sqoop (Data movement)

Oozie (Job workflow)

…. There is a LOT of "Hadoop" software

• Open Data Platform Initiative – facilitate adoption of Hadoop

What is Hadoop

6

Page 8: Hadoop as Storage for Aging Data

Text Analytics

POSIX Distributed

Filesystem

Multi-workload, multi-tenant

scheduling

IBM BigInsights

Enterprise Management

Machine Learning on

Big R

Big R (R support)

IBM Open Platform with Apache Hadoop

(HDFS, YARN, MapReduce, Ambari, Flume, HBase, Hive, Kafta, Knox, Oozie, Pig, Slider, Solr, Spark, Sqoop, Zookeeper)

IBM BigInsights

Data Scientist

IBM BigInsights

Analyst

Big SQL

BigSheets

Industry standard SQL

(Big SQL)

Spreadsheet-style

tool (BigSheets)

Overview of BigInsights

Free Quick Start (non production):

• IBM Open Platform

• BigInsights Analyst, Data Scientist

features

• Community support

. . .

IBM BigInsights for

Apache Hadoop

IBM InfoSphere BigInsights is now IBM BigInsights for Apache Hadoop!

Page 9: Hadoop as Storage for Aging Data

Part II - Hadoop and SQL

Page 10: Hadoop as Storage for Aging Data

• Data in Hadoop is truly “schema-less”

• Hadoop is API based at lower levels

Requires strong programming expertise

Steep learning curve

Simple operations can be tedious

• In most cases, data is structured

e.g. aging old data

• SQL when possible

Already lots of expertise available

Separation of what you want vs. how to get it

Robust ecosystem of tools

SQL on Hadoop

9

Page 11: Hadoop as Storage for Aging Data

• Part of the value-add packages for BigInsights

• IBM’s SQL for Hadoop

Opens Hadoop data to a wider audience

• Complements the Data Warehouse

Exploratory analytics

Sandbox

Data Lake

• Use familiar SQL tools

Cognos

MicroStrategy

DSM

What is Big SQL?

Real-time Analytics InfoSphere Streams

Data Governance and SecurityData Click, LDAP and Secured Cluster

Data ExplorationBigSheets “schema-on-read” tooling

Predictive ModelingBig R scalable data mining” on R

Text AnalyticsText processing with AQL

Application Tooling Toolkits and accelerators

Page 12: Hadoop as Storage for Aging Data

• Key advantages of Big SQL

Performance

SQL “Richness”

Enterprise features & security

• How we’re doing it

Design principles

Architecture

Big SQL - topics

11

Page 13: Hadoop as Storage for Aging Data

Big SQL V4.1 vs Hive 1.2.1 @ 1TB TPC-DS*

12

*Not an official TPC-DS Benchmark.

Page 14: Hadoop as Storage for Aging Data

• In 99 / 99 Queries, Big SQL was faster

• On Average, Big SQL was 21X faster

• Excluding the Top 5 and Bottom 5 results, Big SQL was 19X faster

Big SQL V4.1 vs Hive 1.2.1 @ 1TB TPC-DS*

13

*Not an official TPC-DS Benchmark.

Page 15: Hadoop as Storage for Aging Data

14

Big SQL runs more SQL out-of-box

Big SQL 4.1 Spark SQL 1.5.0

1 hour 3-4 weeksPorting Effort:

Big SQL can execute all 99

queries with minimal porting

effort

Single stream results:

Big SQL was faster than Spark SQL 76 / 99 Queries

When Big SQL was slower, it was only slower by

1.6X on average

Query vs. Query, Big is on average 5.5X faster

Removing Top 5 / Bottom 5, Big SQL is 2.5X faster

Page 16: Hadoop as Storage for Aging Data

• Leverage IBM’s rich SQL history and expertise

Modern SQL:2011 capabilities

DB2 compatible SQL – PL support

• SQL bodied functions

• SQL bodied stored procedures

• Robust error handling

• Application logic/security encapsulation

• Flow of control logic

• Same SQL syntax used in the EDW works on Big SQL.

• Grouping sets with CUBE and ROLLUP

• Windowing and OLAP aggregates

Big SQL - SQL Capabilities

15

Page 17: Hadoop as Storage for Aging Data

16

Big SQL runs more SQL out-of-box

Big SQL 4.1 Spark SQL 1.5.0

1 hour 3-4 weeksPorting Effort:

Big SQL can execute all 99

queries with minimal porting

effort

Single stream results:

Big SQL was faster than Spark SQL 76 / 99 Queries

When Big SQL was slower, it was only slower by

1.6X on average

Query vs. Query, Big is on average 5.5X faster

Removing Top 5 / Bottom 5, Big SQL is 2.5X faster

Page 18: Hadoop as Storage for Aging Data

Hadoop-DS: Performance Test update:Big SQL V4.1 vs. Spark SQL 1.5.1 @ 1 TB, single stream*

17*Not an official TPC-DS Benchmark.

Page 19: Hadoop as Storage for Aging Data

But, … what happens when you scale it?

Scale Single Stream 4 Concurrent Streams

1 TB • Big SQL was faster on 76 / 99

Queries

• Big SQL averaged 5.5X faster

• Removing Top / Bottom 5, Big SQL

averaged 2.5X faster

• Spark SQL FAILED on 3 queries

• Big SQL was 4.4X faster*

10 TB • Big SQL was faster on 80/99 Queries

• Spark SQL FAILED on 7 queries

• Big SQL averaged 6.2X faster*

• Removing Top / Bottom 5, Big SQL

averaged 4.6X faster

• Big SQL elapsed time for workload was

better than linear

• Spark SQL could not complete the

workload (numerous issues). Partial results

possible with only 2 concurrent streams.

*Compares only queries that both Big SQL and Spark SQL could complete (benefits Spark SQL)

More Users

More

Data

Page 20: Hadoop as Storage for Aging Data

What is the verdict? Use the right tool for the right job

Machine Learning

Simpler SQL

Good Performance

Ideal tool for BI Data

Analysts and production

workloads

Ideal tool for Data

Scientists and discovery

Big SQL Spark SQL

Migrating existing

workloads to Hadoop

Security

Many Concurrent Users

Best in-class Performance

Page 21: Hadoop as Storage for Aging Data

• Modern security model

GRANT/REVOKE support

GROUP permissions

ROW level permissions

COLUMN level permissions

• Data Auditing tools

IBM InfoSphere Guardium (e.g.)

Big SQL enterprise grade features

20

Page 22: Hadoop as Storage for Aging Data

Big SQL Architecture

21

Management Node

Big SQL

Master Node

Management Node

Big SQL

Scheduler

Big SQL

Worker Node

Java

I/O

FMP

Native

I/O

FMP

HDFS

Data

Node

MRTask

Tracker

Other

ServiceHDFS

Data HDFS

Data HDFS

Data

Temp

Data

UDF

FMP

Compute Node

Database

Service

Hive

Metastore

Hive

Server

Big SQL

Worker Node

Java

I/O

FMP

Native

I/O

FMP

HDFS

Data

Node

MRTask

Tracker

Other

ServiceHDFS

Data HDFS

Data HDFS

Data

Temp

Data

UDF

FMP

Compute Node

Big SQL

Worker Node

Java

I/O

FMP

Native

I/O

FMP

HDFS

Data

Node

MRTask

Tracker

Other

ServiceHDFS

Data HDFS

Data HDFS

Data

Temp

Data

UDF

FMP

Compute Node

DDL

FMP

UDF

FMP

*FMP = Fenced mode process

Page 23: Hadoop as Storage for Aging Data

• Head node HA Support

• Easy to extend

Add data nodes – more storage

Worker nodes – processing power

• Fault tolerance

Only use online workers for computations

MPP – Flexible architecture

22

Page 24: Hadoop as Storage for Aging Data

Big SQL – data

23

CREATE HADOOP TABLE bigsql.sales (

part_id INTEGER,

part_name VARCHAR(32),

quantity INTEGER,

cost DECIMAL(10,4)

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

LOCATION ‘/user/bigsql/sales’;

12321,apple,10,10.20

32881,banana,5,11.91

43321,salt,1,0.90

44552,sugar,1,0.80

Does not actually “store” data -> interprets it.

Page 25: Hadoop as Storage for Aging Data

• Support for all major file formats

ORC

Parquet

Avro

Sequence

etc.

• Does not lock users in

Files can be used with external tools (Spark, MR, etc)

Big SQL – open storage

24

Page 26: Hadoop as Storage for Aging Data

Part III – Tools to offload data to Online Archive

Page 27: Hadoop as Storage for Aging Data

• Moves data from EDWs and external stores into HDFS

• Bulk import

Import individual tables or entire databases

• SQL import

Import result of SQL execution

• Data interaction

Support for interacting with data through Java classes

• Data export

Offload data from HDFS into relational databases

Simple with BigInsights - Sqoop

26

Sqoop

Hadoop

HDFS

EDW

Page 28: Hadoop as Storage for Aging Data

• SQL statement executable in Big SQL

• Extracts data from external sources

RDBMS

EDW

HDFS

• Loads data into Big SQL tables

Understands Big SQL tables (e.g. columns, partitioning, etc)

Understands Big SQL table properties(e.g. delimiters, etc)

Simple with Big SQL - LOAD

27

CREATE HADOOP TABLE staff(

id INT,

name VARCHAR(64),

job VARCHAR(10))

PARTITIONED BY (years INT);

LOAD HADOOP USING

JDBC CONNECTION URL

‘jdbc:db2://edw:5000/SAMPLE’

WITH PARAMETERS (

‘credential.file’=

‘/user/bigsql/mycreds’

)

FROM TABLE STAFF

COLUMNS(ID, NAME, JOB, YEARS)

WHERE ‘YEARS is not null’

WITH TARGET TABLE PROPERTIES(

‘delimited.drop.loadedlims’=‘true’

)

WITH LOAD PROPERTIES (

‘max.rejected.records’=50,

‘num.map.tasks’ = 2)

Page 29: Hadoop as Storage for Aging Data

• Transparent

acts as a single source for all data sources

• Extensible

Bring heterogeneous sources together

• High functionality

use the same rich SQL against all sources

• High performance

understands capabilities of sources and pushes down computation

Simple with Big SQL – Federation

28

Page 30: Hadoop as Storage for Aging Data

Part IV – Tips for SQL better usability of your Online Archive

Page 31: Hadoop as Storage for Aging Data

• SQL is standard but internals are not

• Some things require more attention

Point queries

Normalization

DML (UPDATE/DELETE)

Tips for better performance

30

Page 32: Hadoop as Storage for Aging Data

• Table partitioning for data elimination

• Special file formats

ORC

Parquet

• HBase with secondary indexes

Point query optimization

31

Page 33: Hadoop as Storage for Aging Data

• Partitioning is defined on one or more columns

• Each unique value becomes a partition

• Query predicates are used to eliminate partitions

Big SQL table partitioning

32

CREATE TABLE demo.sales (

part_id int,

part_name string,

qty int,

cost double

)

PARTITIONED BY (

state char(2)

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '|';

biginsights

hive

warehouse

demo.db

sales

state=NJ

state=AR

state=CA

state=NY

data1.csv

data1.csvdata2.csv

data1.csvselect * from demo.sales

where state in ('NJ', 'CA');

Page 34: Hadoop as Storage for Aging Data

• What they are:

Columnar file formats

Store statistics

• Pros

Efficient compression

Predicates are pushed down and evaluated at file level!

We skip files for which predicates don’t match

• Cons

Not easy to integrate outside of Hadoop

Big SQL special file formats (ORC, Parquet)

33

CREATE HADOOP TABLE orc_table

(col1 int, col2 string, col3 bigint)

STORED AS ORC

CREATE HADOOP TABLE parquet_table

(col1 int, col2 string, col3 bigint)

STORED AS PARQUETFILE

Page 35: Hadoop as Storage for Aging Data

• Clustered Environment

Master and Region Servers

Automatic split ( sharding )

• Key-value store

Key and value are byte arrays

Efficient access using row key

• HBase supports secondary

indexes!

HBase

34

Page 36: Hadoop as Storage for Aging Data

[De]normalization – complex types

35

CREATE HADOOP TABLE people(

id INT,

name VARCHAR(64),

phoneNo ARRAY<VARCHAR(32)>

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

COLLECTION ITEMS TERMINATED BY ‘|’

SELECT p.id, p.name, tel.no

FROM people p

UNNEST (p.phoneNo) tel(no)

ID NAME Number

1 Eugen Stoianovici +353 8733 -----

2 Eugen Stoianovici +353 8700 -----

3 John Smith +353 8732 -----

… … …

Page 37: Hadoop as Storage for Aging Data

• Fault tolerance to data (errors in semi-structured data do not cause queries to fail)

• Fault tolerance to hardware

Hardware maintenance performed by threshold

Hardware failure does not impact system

• Elastic clusters

Add more nodes when more capacity is required

Reassign nodes when batch job is done

Big SQL – elastic and fault tolerant

36

Page 38: Hadoop as Storage for Aging Data

Notices and Disclaimers

37

Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form

without written permission from IBM.

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.

Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for

accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to

update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO

EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO,

LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted

according to the terms and conditions of the agreements under which they are provided.

Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as

illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other

results in other operating environments may vary.

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services

available in all countries in which IBM operates or does business.

Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the

views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or

other guidance or advice to any individual participant or their specific situation.

It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the

identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the

customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will

ensure that the customer is in compliance with any law.

Page 39: Hadoop as Storage for Aging Data

Notices and Disclaimers (con’t)

38

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly

available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,

compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the

suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to

interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT

LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,

trademarks or other intellectual property right.

• IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document

Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM

SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON,

OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,

pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ,

Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of

International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be

trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:

www.ibm.com/legal/copytrade.shtml.

Page 40: Hadoop as Storage for Aging Data

Questions?

39

Page 41: Hadoop as Storage for Aging Data

• BigInsights – Hadoop open platform integration in the warehouse

• Big SQL architecture and compatibility with enterprise data warehouses

• Data movement tools

• Tips from Big SQL performance

Wrap up

40

Page 42: Hadoop as Storage for Aging Data

41

Get started with Big SQL: External resources

Hadoop Dev: links to videos, white paper, lab, . . . .

https://developer.ibm.com/hadoop/

Page 43: Hadoop as Storage for Aging Data

We Value Your Feedback!

Don’t forget to submit your Insight session and speaker

feedback! Your feedback is very important to us – we use it

to continually improve the conference.

Access your surveys at insight2015survey.com to quickly

submit your surveys from your smartphone, laptop or

conference kiosk.

42

Page 44: Hadoop as Storage for Aging Data

© 2015 IBM Corporation

Thank You