hadoop as storage for aging data

© 2015 IBM Corporation

DDB-1560:Hadoop as Storage for Aging Data -Simplicity with Big SQL and DB2

Eugen Stoianovici ([email protected])

Software Developer , IBM

02:00 pm - 03:00 pm

Thursday, October 29th, 2015

mailto:[email protected]

• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal

without notice at IBM’s sole discretion.

• Information regarding potential future products is intended to outline our general product direction

and it should not be relied on in making a purchasing decision.

• The information mentioned regarding potential future products is not a commitment, promise, or

legal obligation to deliver any material, code or functionality. Information about potential future

products may not be incorporated into any contract.

• The development, release, and timing of any future features or functionality described for our

products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a

controlled environment. The actual throughput or performance that any user will experience will vary

depending upon many factors, including considerations such as the amount of multiprogramming in the

user’s job stream, the I/O configuration, the storage configuration, and the workload processed.

Therefore, no assurance can be given that an individual user will achieve results similar to those stated

here.

Please Note:

2

• Problem description – data aging

• Big SQL - SQL on Hadoop for online query archive

• Data movement tools with Big Insights

• Big SQL – Tips for maximizing usability

Agenda

2

Modern Warehouse

Exploration,Integrated

Warehouse, and Mart Zones

Discovery

Deep Reflection

Operational

Predictive

All Data Sources

Decision

Management

BI and Predictive

Analytics

Analytic

Applications

Intelligence

Analysis

Raw DataStructured DataText AnalyticsData MiningEntity AnalyticsMachine Learning

Video/Audio

Network/Sensor

Entity Analytics

Predictive

Stream Processing Data Integration Master Data

Streams

Information Governance, Security and Business Continuity

Analytic Applications

Real-time Analytic Zone

Landing Area, Analytics Zone

and Archive

Information Ingestion and Operational Information

IBM BigInsights for Hadoop

Discovery

Deep Reflection

Operational

Predictive

All Data Sources

Decision

Management

BI and Predictive

Analytics

Analytic

Applications

Intelligence

Analysis

Raw DataStructured DataText AnalyticsData MiningEntity AnalyticsMachine Learning

Stream Processing Data Integration Master Data

Streams

Information Governance, Security and Business Continuity

Analytic Applications

IBM Big Insights for Hadoop

Exploration,Integrated

Warehouse, and Mart Zones

Landing Area, Analytics Zone

and Archive

Information Ingestion and Operational Information

• Create a large area of storage for cold data

• Immediate cost savings, freeing EDW capacity

Placing lowest detailed schemas in BigInsights / Hadoop

Offloading ETL into a cheaper, more elastic MPP environment

Removing cost of HA/DR from data

• Data remains online, instead of nearline or on tape

More detailed and more variety of data

Retain longer history of detailed data

Queryable Archive

5

• Hadoop is not a piece of software, you can't install “Hadoop"

• It is an ecosystem of software that work together

Hadoop Core (API's)

HDFS (File system)

MapReduce (Data processing )

Hive (SQL access)

HBase (NoSQL database)

Spark (compute engine)

Sqoop (Data movement)

Oozie (Job workflow)

…. There is a LOT of "Hadoop" software

• Open Data Platform Initiative – facilitate adoption of Hadoop

What is Hadoop

6

Text Analytics

POSIX Distributed

Filesystem

Multi-workload, multi-tenant

scheduling

IBM BigInsights

Enterprise Management

Machine Learning on

Big R

Big R (R support)

IBM Open Platform with Apache Hadoop

(HDFS, YARN, MapReduce, Ambari, Flume, HBase, Hive, Kafta, Knox, Oozie, Pig, Slider, Solr, Spark, Sqoop, Zookeeper)

IBM BigInsights

Data Scientist

IBM BigInsights

Analyst

Big SQL

BigSheets

Industry standard SQL

(Big SQL)

Spreadsheet-style

tool (BigSheets)

Overview of BigInsights

Free Quick Start (non production):

• IBM Open Platform

• BigInsights Analyst, Data Scientist

features

• Community support

. . .

IBM BigInsights for

Apache Hadoop

IBM InfoSphere BigInsights is now IBM BigInsights for Apache Hadoop!

Part II - Hadoop and SQL

• Data in Hadoop is truly “schema-less”

• Hadoop is API based at lower levels

Requires strong programming expertise

Steep learning curve

Simple operations can be tedious

• In most cases, data is structured

e.g. aging old data

• SQL when possible

Already lots of expertise available

Separation of what you want vs. how to get it

Robust ecosystem of tools

SQL on Hadoop

9

• Part of the value-add packages for BigInsights

• IBM’s SQL for Hadoop

Opens Hadoop data to a wider audience

• Complements the Data Warehouse

Exploratory analytics

Sandbox

Data Lake

• Use familiar SQL tools

Cognos

MicroStrategy

DSM

What is Big SQL?

Real-time Analytics InfoSphere Streams

Data Governance and SecurityData Click, LDAP and Secured Cluster

Data ExplorationBigSheets “schema-on-read” tooling

Predictive ModelingBig R scalable data mining” on R

Text AnalyticsText processing with AQL

Application Tooling Toolkits and accelerators

• Key advantages of Big SQL

Performance

SQL “Richness”

Enterprise features & security

• How we’re doing it

Design principles

Architecture

Big SQL - topics

11

Big SQL V4.1 vs Hive 1.2.1 @ 1TB TPC-DS*

12

*Not an official TPC-DS Benchmark.

• In 99 / 99 Queries, Big SQL was faster

• On Average, Big SQL was 21X faster

• Excluding the Top 5 and Bottom 5 results, Big SQL was 19X faster

Big SQL V4.1 vs Hive 1.2.1 @ 1TB TPC-DS*

13

*Not an official TPC-DS Benchmark.

14

Big SQL runs more SQL out-of-box

Big SQL 4.1 Spark SQL 1.5.0

1 hour 3-4 weeksPorting Effort:

Big SQL can execute all 99

queries with minimal porting

effort

Single stream results:

Big SQL was faster than Spark SQL 76 / 99 Queries

When Big SQL was slower, it was only slower by

1.6X on average

Query vs. Query, Big is on average 5.5X faster

Removing Top 5 / Bottom 5, Big SQL is 2.5X faster

• Leverage IBM’s rich SQL history and expertise

Modern SQL:2011 capabilities

DB2 compatible SQL – PL support

• SQL bodied functions

• SQL bodied stored procedures

• Robust error handling

• Application logic/security encapsulation

• Flow of control logic

• Same SQL syntax used in the EDW works on Big SQL.

• Grouping sets with CUBE and ROLLUP

• Windowing and OLAP aggregates

Big SQL - SQL Capabilities

15

16

Big SQL runs more SQL out-of-box

Big SQL 4.1 Spark SQL 1.5.0

1 hour 3-4 weeksPorting Effort:

Big SQL can execute all 99

queries with minimal porting

effort

Single stream results:

Big SQL was faster than Spark SQL 76 / 99 Queries

When Big SQL was slower, it was only slower by

1.6X on average

Query vs. Query, Big is on average 5.5X faster

Removing Top 5 / Bottom 5, Big SQL is 2.5X faster

Hadoop-DS: Performance Test update:Big SQL V4.1 vs. Spark SQL 1.5.1 @ 1 TB, single stream*

17*Not an official TPC-DS Benchmark.

But, … what happens when you scale it?

Scale Single Stream 4 Concurrent Streams

1 TB • Big SQL was faster on 76 / 99

Queries

• Big SQL averaged 5.5X faster

• Removing Top / Bottom 5, Big SQL

averaged 2.5X faster

• Spark SQL FAILED on 3 queries

• Big SQL was 4.4X faster*

10 TB • Big SQL was faster on 80/99 Queries

• Spark SQL FAILED on 7 queries

• Big SQL averaged 6.2X faster*

• Removing Top / Bottom 5, Big SQL

averaged 4.6X faster

• Big SQL elapsed time for workload was

better than linear

• Spark SQL could not complete the

workload (numerous issues). Partial results

possible with only 2 concurrent streams.

*Compares only queries that both Big SQL and Spark SQL could complete (benefits Spark SQL)

More Users

More

Data

What is the verdict? Use the right tool for the right job

Machine Learning

Simpler SQL

Good Performance

Ideal tool for BI Data

Analysts and production

workloads

Ideal tool for Data

Scientists and discovery

Big SQL Spark SQL

Migrating existing

workloads to Hadoop

Security

Many Concurrent Users

Best in-class Performance

• Modern security model

GRANT/REVOKE support

GROUP permissions

ROW level permissions

COLUMN level permissions

• Data Auditing tools

IBM InfoSphere Guardium (e.g.)

Big SQL enterprise grade features

20

Big SQL Architecture

21

Management Node

Big SQL

Master Node

Management Node

Big SQL

Scheduler

Big SQL

Worker Node

Java

I/O

FMP

Native

I/O

FMP

HDFS

Data

Node

MRTask

Tracker

Other

ServiceHDFS

Data HDFS

Data HDFS

Data

Temp

Data

UDF

FMP

Compute Node

Database

Service

Hive

Metastore

Hive

Server

Big SQL

Worker Node

Java

I/O

FMP

Native

I/O

FMP

HDFS

Data

Node

MRTask

Tracker

Other

ServiceHDFS

Data HDFS

Data HDFS

Data

Temp

Data

UDF

FMP

Compute Node

Big SQL

Worker Node

Java

I/O

FMP

Native

I/O

FMP

HDFS

Data

Node

MRTask

Tracker

Other

ServiceHDFS

Data HDFS

Data HDFS

Data

Temp

Data

UDF

FMP

Compute Node

DDL

FMP

UDF

FMP

*FMP = Fenced mode process

• Head node HA Support

• Easy to extend

Add data nodes – more storage

Worker nodes – processing power

• Fault tolerance

Only use online workers for computations

MPP – Flexible architecture

22

Big SQL – data

23

CREATE HADOOP TABLE bigsql.sales (

part_id INTEGER,

part_name VARCHAR(32),

quantity INTEGER,

cost DECIMAL(10,4)

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

LOCATION ‘/user/bigsql/sales’;

12321,apple,10,10.20

32881,banana,5,11.91

43321,salt,1,0.90

44552,sugar,1,0.80

Does not actually “store” data -> interprets it.

• Support for all major file formats

ORC

Parquet

Avro

Sequence

etc.

• Does not lock users in

Files can be used with external tools (Spark, MR, etc)

Big SQL – open storage

24

Part III – Tools to offload data to Online Archive

• Moves data from EDWs and external stores into HDFS

• Bulk import

Import individual tables or entire databases

• SQL import

Import result of SQL execution

• Data interaction

Support for interacting with data through Java classes

• Data export

Offload data from HDFS into relational databases

Simple with BigInsights - Sqoop

26

Sqoop

Hadoop

HDFS

EDW

• SQL statement executable in Big SQL

• Extracts data from external sources

RDBMS

EDW

HDFS

• Loads data into Big SQL tables

Understands Big SQL tables (e.g. columns, partitioning, etc)

Understands Big SQL table properties(e.g. delimiters, etc)

Simple with Big SQL - LOAD

27

CREATE HADOOP TABLE staff(

id INT,

name VARCHAR(64),

job VARCHAR(10))

PARTITIONED BY (years INT);

LOAD HADOOP USING

JDBC CONNECTION URL

‘jdbc:db2://edw:5000/SAMPLE’

WITH PARAMETERS (

‘credential.file’=

‘/user/bigsql/mycreds’

)

FROM TABLE STAFF

COLUMNS(ID, NAME, JOB, YEARS)

WHERE ‘YEARS is not null’

WITH TARGET TABLE PROPERTIES(

‘delimited.drop.loadedlims’=‘true’

)

WITH LOAD PROPERTIES (

‘max.rejected.records’=50,

‘num.map.tasks’ = 2)

• Transparent

acts as a single source for all data sources

• Extensible

Bring heterogeneous sources together

• High functionality

use the same rich SQL against all sources

• High performance

understands capabilities of sources and pushes down computation

Simple with Big SQL – Federation

28

Part IV – Tips for SQL better usability of your Online Archive

• SQL is standard but internals are not

• Some things require more attention

Point queries

Normalization

DML (UPDATE/DELETE)

Tips for better performance

30

• Table partitioning for data elimination

• Special file formats

ORC

Parquet

• HBase with secondary indexes

Point query optimization

31

• Partitioning is defined on one or more columns

• Each unique value becomes a partition

• Query predicates are used to eliminate partitions

Big SQL table partitioning

32

CREATE TABLE demo.sales (

part_id int,

part_name string,

qty int,

cost double

)

PARTITIONED BY (

state char(2)

)


FIELDS TERMINATED BY '|';

biginsights

hive

warehouse

demo.db

sales

state=NJ

state=AR

state=CA

state=NY

data1.csv

data1.csvdata2.csv

data1.csvselect * from demo.sales

where state in ('NJ', 'CA');

• What they are:

Columnar file formats

Store statistics

• Pros

Efficient compression

Predicates are pushed down and evaluated at file level!

We skip files for which predicates don’t match

• Cons

Not easy to integrate outside of Hadoop

Big SQL special file formats (ORC, Parquet)

33

CREATE HADOOP TABLE orc_table

(col1 int, col2 string, col3 bigint)

STORED AS ORC

CREATE HADOOP TABLE parquet_table

(col1 int, col2 string, col3 bigint)

STORED AS PARQUETFILE

• Clustered Environment

Master and Region Servers

Automatic split ( sharding )

• Key-value store

Key and value are byte arrays

Efficient access using row key

• HBase supports secondary

indexes!

HBase

34

•

[De]normalization – complex types

35

CREATE HADOOP TABLE people(

id INT,

name VARCHAR(64),

phoneNo ARRAY<VARCHAR(32)>

)


FIELDS TERMINATED BY ‘,’

COLLECTION ITEMS TERMINATED BY ‘|’

SELECT p.id, p.name, tel.no

FROM people p

UNNEST (p.phoneNo) tel(no)

ID NAME Number

1 Eugen Stoianovici +353 8733 -----

2 Eugen Stoianovici +353 8700 -----

3 John Smith +353 8732 -----

… … …

• Fault tolerance to data (errors in semi-structured data do not cause queries to fail)

• Fault tolerance to hardware

Hardware maintenance performed by threshold

Hardware failure does not impact system

• Elastic clusters

Add more nodes when more capacity is required

Reassign nodes when batch job is done

Big SQL – elastic and fault tolerant

36

Notices and Disclaimers

37

Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form

without written permission from IBM.

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.

Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for

accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to

update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO

EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO,

LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted

according to the terms and conditions of the agreements under which they are provided.

Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as

illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other

results in other operating environments may vary.

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services

available in all countries in which IBM operates or does business.

Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the

views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or

other guidance or advice to any individual participant or their specific situation.

It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the

identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the

customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will

ensure that the customer is in compliance with any law.

Notices and Disclaimers (con’t)

38

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly

available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,

compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the

suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to

interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT

LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,

trademarks or other intellectual property right.

• IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document

Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM

SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON,

OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,

pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ,

Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of

International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be

trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:

www.ibm.com/legal/copytrade.shtml.

http://www.ibm.com/legal/copytrade.shtml

Questions?

39

• BigInsights – Hadoop open platform integration in the warehouse

• Big SQL architecture and compatibility with enterprise data warehouses

• Data movement tools

• Tips from Big SQL performance

Wrap up

40

41

Get started with Big SQL: External resources

Hadoop Dev: links to videos, white paper, lab, . . . .

https://developer.ibm.com/hadoop/

https://developer.ibm.com/hadoop/

We Value Your Feedback!

Don’t forget to submit your Insight session and speaker

feedback! Your feedback is very important to us – we use it

to continually improve the conference.

Access your surveys at insight2015survey.com to quickly

submit your surveys from your smartphone, laptop or

conference kiosk.

42

http://insight2015survey.com/

hadoop as storage for aging data

Data & Analytics