hadoop and sql: delivery analytics across the organization

© 2015 IBM Corporation

Hadoop and SQL: Delivering Analytics Across the organization (DHS-2147)

Nicholas Berg, Seagate

Adriana Zubiri, IBM

27-Oct-2015 2:30 PM-3:30 PM

• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal

without notice at IBM’s sole discretion.

• Information regarding potential future products is intended to outline our general product direction

and it should not be relied on in making a purchasing decision.

• The information mentioned regarding potential future products is not a commitment, promise, or

legal obligation to deliver any material, code or functionality. Information about potential future

products may not be incorporated into any contract.

• The development, release, and timing of any future features or functionality described for our

products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a

controlled environment. The actual throughput or performance that any user will experience will vary

depending upon many factors, including considerations such as the amount of multiprogramming in the

user’s job stream, the I/O configuration, the storage configuration, and the workload processed.

Therefore, no assurance can be given that an individual user will achieve results similar to those stated

here.

Please Note:

2

A New SeagateSEAGATE is in a unique position to CREATE EVEN MORE VALUE

for our customers by integrating our 35+ years of storage expertise in

HDD with FLASH, SYSTEMS, SERVICES AND CONSUMER DEVICES

to deliver unique solutions that enable our customers to

ENJOY AND GET VALUE FROM THEIR DATA

more than ever before.HYBRID SOLUTIONS

HDD FLASH

SILICON

BRANDED

SYSTEMS

• $14B Annual Revenue

• 2 billion drives shipped

• Stores more than 40% of the world’s data

• 43,000 Cloud services clients worldwide

• 50,000 Employees, 26 countries

• 9 Manufacturing plants: US, China, Malaysia,

N.Ireland, Singapore, Thailand

• 5 Design centers: US, Singapore, South Korea

• Vertically integrated factories from Silicon

fabrication to Drive assembly

SYSTEMSHD FLASH SILICON PREMIUM HYBRID SOLUTIONSSYSTEMSFLASH BRANDEDHYBRID SOLUTIONS

LSI

Where to start with Hadoop - find a use case

• Experimented with text analysis of Call Center logs

Proved out the use case, but Big Data text analytics built into Call Center support applications met the need without in-house costs

• Marketing organization had some social media Big Data Use cases

These are being met by companies specializing in this kind of Big Data analysis

• Reviewed other potential use cases such as:

Mining data center support, performance and maintenance logs

Mining large data sets for IT Security

• Tested loading up some volume factory test log data and run some analytics

Compelling use case for Hadoop: Deeper and wider analysis of Factory and Field data

Traditional Data Architecture Pressured

4.4 ZB in 2013

85% from New Data Types

15x Machine Data by 2020

44 ZB by 2020

ZB = 1B TB

Seagate’s high-level plans for Hadoop

• Enterprise Hadoop cluster as extension of EDW (augmentation)

Ability to store and analyze 10x-20x Factory and Field data

Much longer retention of relevant manufacturing data

Multi-purpose analytic environment supporting hundreds of potential users across Engineering, Manufacturing and Quality

• Possible local factory Hadoop clusters for special-purpose processing

• Eventual integration across multiple clusters and sites

• At a high level, Hadoop will enable us to

Ask questions we could never ask before...

About data volumes we could never collect and store before…

Doing analysis we could never perform in reasonable time…

And connecting data that could never before be retained for combined analysis

Hadoop: a dynamic and ever changing landscape

• When we first started our Hadoop journey, MapReduce was the main way to access and query HDFS data

• Two years on, the Hadoop world had changed with SQL being a major force in Hadoop (Hive, Impala, BigSQL)

• SQL on Hadoop helps address three main Hadoop challenges:

Addresses a skills gap: Hadoop MapReduce needs Java coders vs. using existing SQL skills

SQL provides integration with existing environments and tools (i.e. databases and BI tools)

Enables Hadoop to move from batch processing to interactive analysis

• New memory based Apache projects are being developed that allow for even faster interactive analysis like Apache Spark but SQL is still core to these too

Big data for the enterprise

• We put together a five year Big Data vision statement and strategy plan

Socialized strategy plan for feedback

• Decided to conducted a large scale Hadoop pilot

We wanted to really understand what Hadoop’s real capabilities and potential were

• Purchased 60 node cluster: 3 management nodes, 57 data nodes. (Now increased to two cluster 60 + 100 nodes)

• Performed an analysis on which Hadoop distribution to use

• Defined what use cases to run in our large scale pilot

Choosing a Hadoop software distribution

• Two main flavors: open source oriented or more proprietary

• Open source oriented solutions are the most beneficial:

Portable - easily move your Hadoop cluster from vendor to vendor

Avoids vendor lock into expensive and proprietary technology

Open source projects ensure interoperability with other open source projects

• Other important considerations:

Integration with RDMSs, BI solutions and other platforms

R&D investment and support capability

Consulting and training

• Seagate chose IBM because we believe they have the most advanced SQL “add-on” Hadoop capability, some other strong Hadoop tools like BigR and excellent support services

Evolving to a Logical Data Warehouse

• A Logical Data Warehouse combines traditional data warehouses with big data systems to evolve your analytics capabilities beyond where you are today

• Hadoop does not replace your EDW. EDW is a good “general purpose” data management solution for integrating and conforming enterprise data to produce your everyday business analytics

• A typical EDW may have 100’s of data feeds, dozens of integrated applications and run 1000’s to 100,000’s of queries a day

• Hadoop is more specialized and much less mature. For now it will have only a few application integration points and run fewer queries at a lower concurrency, answering different questions

• A Hadoop cluster of 60-100 nodes is a supercomputer. What would you use a supercomputer for? Probably to answer the really big questions

Some early practices and learnings

• Incremental phased delivery, or use case by use case

• Form a “data lake” or “data reservoir” for all enterprise data

• Data availability must come first, model and transform the data in place within Hadoop

resist moving the data again

• Lots of talk about schema on read but for Data Warehousing types of uses, this is impractical

Data modeling is still required but can be simplified

• Have multiple clusters: Development, Test and then two or more Production, one for Ad Hoc data exploration & experimentation, one for more governed uses with guaranteed cluster availability to run important jobs

• Use existing custom query/analytics solution to provide “transparent” access to Hadoop

Enterprise Hadoop Architecture

The Data Lake: data tiering

13

Tier 1 / Tier 2 custom data loading application

14

Data Transport

• Scoop: Pull EDW data to HDFS Tier 1

• Non-EDW files (Factory push):

• Trickle feed files to staging area

• Unzip, Merge, reZip small files to large files

• Push compacted files to HDFS Tier 1

Data Mapping & Loading

• Match source/target columns

• Detect and handle column changes

• Transform data

• Insert or Update data in Tier 2

• Dual feed to cluster 2 Tier 1 Tier 2

Scheduling

• Oozie backend

• Configurable frequency

• Currently Daily

• Snapshots (waits for data loads to complete)

• Meta data backups

Compaction

• Major and Minor compaction

• Minor: merges small files to large ones

• Major: remove old versions of data (updates)

• Consolidates HDFS directories

T1/T2 App

Hadoop cluster data feeding and querying

15

EDW

Factory Data Systems

UNIX

HD

FS

SQOOP

Map Reduce

Big SQL

Hive

Pig

HCatalog

Big RR

Gan

glia

| N

agio

s

Compact & Load

Tier 3 (Derived Hadoop Tables)

Tier 2 (Hive Tables)

Tier 1 (Delimited Text Files)

Component

REA

D

JDBC | ODBC| Other Drivers

WR

ITE

Data Science Applications

(SAS, Python, ML)10% Drive Sampled Drive Data

10

0%

SparkC

om

po

nen

t

Yarn

Adding Hive update support• Hive is a good structured table format for querying but it does

not support row updates to large fact tables

This type of capability is known in the database arena as ACID

ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably

Atomicity: requires that each transaction be "all or nothing”. If one part of the transaction fails, the entire transaction fails, and the database state is left unchanged

Consistency: Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof

Isolation: ensures that the concurrent execution of transactions results are the same if transactions were executed serially

Durability: means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors

16

Custom Serde (Hive row serializer/deserializer)

17

Split input files into fragments for

individual map tasks

Reads index files into memory

(helps identify duplicate records)

Provides RecordReader Factory

UpdInputFormat

Reads splits and loads data

Discards old versions of rows using info

from the index

Converts individual records into Writable

objects suitable for Mapper

UpdRecordReader

HDFS

MapReduce

jobs for Hive

read SQL

Provides RecordWriter Factory

UpdOutputFormat

Writes each Writable back to HDFS in

user serialization format

Writes index files to HDFS (with PKs in

each version to help identify duplicate

records)

UpdRecordWriter

MapReduce

jobs for Hive

write SQL

Hive Read Query

Hive Write Query

Hadoop challenges(an emerging and evolving platform)

• Knowing which Hadoop projects to “bet on”, which data formats and compression types to use

• Speed of change: probably has more code been written than any other IT platform

Need to upgrade cluster software frequently (once a quarter)

• Gaps: Some things not ready like ACID, real-time queries

• Resource management for different types of workloads

• Lack of BI tools that can really take advantage of huge data sets and visualize them

• Still very batch processing orientated but interactive is gaining traction with Spark etc.

• Provisioning large numbers of machines, hardware failures

• Integrating remote clusters, cross cluster data movement and inter-cluster processing

Hadoop projects – setting expectations

• Completely new and an awful lot to learn, design & implementation are huge tasks

• Hadoop is still quite immature and lacks robustness

Exhibits instability, buggy, new code released too early

• Speed of change: management need to understand that plans will be dynamic and will change with the evolving technology

Have less formal schedules, manage expectations to the low side

• Be flexible and adaptable as technology changes and matures

Be ready to change and adapt to new technology or if support dries up on a Hadoop project

• Developing IT skills quickly

Finding experienced and talented Hadoop staff or consultants

Keeping up with the data scientists

• Convincing security and data center teams to give Hadoop users UNIX level access

20

• IBM Open Platform –Foundation of 100% pure open source Apache Hadoop components

• Standardizing as the Open Data Platform (http://opendataplatform.org)

About the IBM Open Platform for Apache Hadoop

All Standard Apache Open Source Components

HDFS

YARN

MapReduce

Ambari HBase

Spark

Flume

Hive Pig

Sqoop

HCatalog

Solr/Lucene

ODP

http://opendataplatform.org/

Data shared with Hadoop ecosystem

Comprehensive file format support

Superior enablement of IBM software

Enhanced by Third Party software

Modern MPP runtime

Powerful SQL query rewriter

Cost based optimizer

Optimized for concurrent user throughput

Distributed requests to multiple data sources

within a single SQL statement

Main data sources supported:

DB2, Teradata, Oracle, Netezza, MS SQL

Server, Informix

Advanced security/auditing

Resource and workload management

Self tuning memory management

Comprehensive monitoring

Comprehensive SQL Support

IBM SQL PL compatibility

Extensive Analytic Functions

Big SQL At a Glance

New functionality in Big SQL in 2015

• Ambari Installation and configuration

• Hbase support

• Rich Management User Interface

• Data types

New primitive data types support (decimal, char, varbinary)

Complex data types support (array, struct, map)

Enhancements to varchar and date

• Platforms

Power support

RHEL 6.6 anmd 7.1 support

• More Performance

HDFS caching

UDFs performance improvements

ANALYZE enhancements (2.5X faster than 3.0 FP2)

Native implementations of key Hive built-in functions

SQL Enhancements

New analytic procedures

New olap functions and aggregate

functionality

Offset support for limit and fetch

first

Ability to directly define and

execute Hive User Define

Functions (UDFs)

Other Improvements

Improved support for concurrent

LOAD operations

Support for importing with the

Teradata Connector for Hadoop

(TDH)

Added SQL server 2012 and

DB2/Z support

CMX compression support now

supported in the native I/O engine

High Availability (FP2)

Technical Previews

Yarn/Slider Integration

Spark integration (FP2)

Hadoop-DS: Performance Test update:Big SQL V4.1 vs. Spark SQL 1.5.1 @ 1 TB, single stream*

23*Not an official TPC-DS Benchmark.

24

Big SQL runs more SQL out-of-box

Big SQL 4.1 Spark SQL 1.5.0

1 hour 3-4 weeksPorting Effort:

Big SQL can execute all 99

queries with minimal porting

effort

Single stream results:

Big SQL was faster than Spark SQL 76 / 99 Queries

When Big SQL was slower, it was only slower by

1.6X on average

Query vs. Query, Big is on average 5.5X faster

Removing Top 5 / Bottom 5, Big SQL is 2.5X faster

But, … what happens when you scale it?

Scale Single Stream 4 Concurrent Streams

1 TB • Big SQL was faster on 76 / 99

Queries

• Big SQL averaged 5.5X faster

• Removing Top / Bottom 5, Big SQL

averaged 2.5X faster

• Spark SQL FAILED on 3 queries

• Big SQL was 4.4X faster*

10 TB • Big SQL was faster on 80/99 Queries

• Spark SQL FAILED on 7 queries

• Big SQL averaged 6.2X faster*

• Removing Top / Bottom 5, Big SQL

averaged 4.6X faster

• Big SQL elapsed time for workload was

better than linear

• Spark SQL could not complete the

workload (numerous issues). Partial results

possible with only 2 concurrent streams.

*Compares only queries that both Big SQL and Spark SQL could complete (benefits Spark SQL)

More Users

More

Data

26

What is the verdict? Use the right tool for the right job

Machine Learning

Simpler SQL

Good Performance

Ideal tool for BI Data

Analysts and production

workloads

Ideal tool for Data

Scientists and discovery

Big SQL Spark SQL

Migrating existing

workloads to Hadoop

Security

Many Concurrent Users

Best in-class Performance

Big SQL Roadmap 2015-2016

27

Hbase Support

Rich management user interface

Complex data types: Array, Struct, Map

Better Performance

New analytic procedures and OLAP functions

Offset support for limit and fetch first

Ability to directly define and execute Hive UDFs

Improve support for concurrent LOAD

Support importing w/ Teradata connector for Hadoop

New federated sources: SQL Server 2012 and DB2/Z, Oracle 12c

Power platform support

Head node High availability

Yarn/slider support

Performance improvements

at large scale

Resiliency improvements

Spark

integration/exploitation

Faster statistics collection

Cumulative statistics

Sampling statistics

Hbase update/delete

Hive update/delete

User define aggregates

Oracle compatibility improvements

Netezza compatibility

improvements

Integration with Ranger

BLU technology exploitation

Self collecting statistics

zLinux platform support

2015 1H2016

We Value Your Feedback!

Don’t forget to submit your Insight session and speaker

feedback! Your feedback is very important to us – we use it

to continually improve the conference.

Access the Insight Conference Connect tool at

insight2015survey.com to quickly submit your surveys from

your smartphone, laptop or conference kiosk.

28

http://insight2015survey.com/

hadoop and sql: delivery analytics across the organization

Data & Analytics