big sql 3.0 - fast and easy sql on hadoop

© 2014 IBM Corporation

z/OS und LUW

Big SQL 3.0 Fast and easy SQL on Hadoop

Wilfried Hoge IT Architect Big Data [email protected] @wilfriedhoge

© 2014 International Business Machines Corporation 2

Hadoop Observations

Technology Customers Vendors

Rapid innovation

Two sources of innovation - Open source community

-  Integration of existing technologies

Tools and application

vendors selecting partners and integrating

High degree of interest

Many experimental

workstreams

ROI establishment varies by use case

Many customers want to offload data from EDW

Multiple business models

OSS support vendors have

mindshare lead

OSS support vendors business model viability

unclear

SW Portfolio vendors integrating/adding


InfoSphere BigInsights provides Enterprise Grade Hadoop analytics

•  Manages a wide variety and huge volume of data

•  Augments open source Hadoop with enterprise capabilities

– Visualization & Exploration – Development tools – Advanced Engines – Connectors – Workload Optimization – Enterprise integration – Analytic Accelerators – Application and industry accelerators – Administration & Security

Accelerators

Information Integration & Governance

Data Warehouse

Stream Computing

Hadoop System

Discovery Application Development

Systems Management

Data Media Content Machine Social

BIG DATA PLATFORM

© 2013 IBM Corporation


Key Differentiators for BigInsights

Enterprise Performance & Integration Analytics Usability

& Productivity

• Workload / performance optimization

• GPFS

• Security

• Key integrations & Connectors with Enterprise Ecosystem

• Text analytics

• Social Data Analytics Accelerators

• Machine Data Analytics Accelerators

• Execute R in an integrated application

•  Big SQL

•  BigSheets

•  Development Tools

•  Web Console


Integrated Web Console

•  Manage BigInsights –  Inspect /monitor system health –  Add / drop nodes –  Start / stop services –  Run / monitor jobs (applications) –  Explore / modify file system –  Create custom dashboards

•  Launch applications –  Spreadsheet-like analysis tool –  Pre-built applications (IBM supplied or

user developed)

•  Publish applications

•  Monitor cluster, applications, data –  Create / view event alerts.


6

Applications

High level languages (SQL, JAQL, PIG, …)

Map/Reduce API

Hadoop DFS API

GPFS HDFS

Distributed Filesystem

Distributed filesystem GPFS FPO gives additional flexibility, security and high availability •  Optional file system alternative to HDFS •  More than 10 years experience with HPC •  Key features

– No single point of failure – Built-in High Availability – POSIX compliance

•  Standard applications cannot use HDFS but they can use GPFS-FPO

– Enhanced Security – Higher performance

•  Allows concurrent read and write by multiple programs

– Recovery capabilties •  Journaling filesystem

– Support for Storage Pools – SnapShot capability


BigInsights has a simple but effective security system based on a gateway to Hadoop

•  All Hadoop servers are connected over a private network

•  Unrestricted communication between cluster servers on the private network

•  BigInsights Web Console acts as a gateway into the cluster

•  Authentication through PAM or LDAP •  Role based authorization •  Authorization will be enforced at 3 levels:

– UI level – Data level – Map-Reduce level

•  Authorization also respected by services (e.g. SQL) •  Kerberos support

Authentication Authority

Gateway / Web Console

External Sources Users

Services Data Nodes

Infrastr. Nodes

Distributed Filesystem


BigSheets to analyze and visualize

•  Model “big data” collected from various sources in spreadsheet-like structures

•  Filter and enrich content with

built-in functions

•  Combine data in different workbooks

•  Visualize results through

spreadsheets, charts

•  Export data into common formats (if desired)

No programming knowledge needed!


9

A centralized dashboard to visualize analytic results: •  BigSheets collections •  Analytic application results •  Monitoring metrics

•  Ability to view BigSheets data flows between and across data sets to quickly navigate and relate analysis and charts

•  Visualize inner outer joins, enhanced filters for BigSheets columns, column data-type mapping for collections and application of analytics to BigSheets columns, … etc

Centralized dashboard & data flows


10

Editors •  A workflow editor that greatly simplifies the

creation of complex Oozie workflows with a consumable interface

•  A Pig/Jaql Editor with content assist and syntax highlighting that enables users to create and execute new applications using Pig or Jaql in local or cluster mode from the Eclipse IDE

Application development & deployment •  Enablement of BigSheets macro

and BigSheets reader development •  Text Analytics development,

including support for modular rule sets

•  Publish new application: BigSheets Macro, BigSheets Reader, AQL module, Jaql module

Tools for Developers 1. Sample your

Data 2. Develop your application using BigInsights tools

3. Test your application

4. Package and publish your application

5. Deploy your application on the cluster


Running Applications on Big Data

•  Browse available applications •  Deploy published applications

(administrators only) •  Launch (or schedule for launch) a

deployed application •  Monitor job (application) execution

status

•  Predefined applications •  Import & Export Data

•  Database & Files •  Web and Social

•  Analyze and Query •  Predictive Analytics •  Text Analytics •  SQL/Hive, Jaql, Pig, Hbase

•  Accelerators


Application linking and interfaces to build new apps •  Compose new

applications from existing applications and BigSheets

•  Invoke analytics applications from the web console, including integration within BigSheets

•  REST data source App that enables users to load data from any data source supporting REST APIs into BigInsights, including popular social media services

•  Sampling App that enables users to sample data for analysis •  Subsetting App that enables users to subset data for data analysis

12


Collaborative Big Data for many roles •  Business Users can get their hands on big

data and use big data applications and BigSheets to get insights into their data

§  Data scientists can perform deeper analysis and get richer insights

§  Administrators are empowered to be more agile through better controls and views into key performance indicators

§  Developers can leverage unified tooling in a Big Data Application Development Lifecycle and are able to create and deploy new types of applications, with enhancements that simplify even complex workflows


Big SQL 3.0 – Architected for Performance

•  Leverage IBM's rich SQL heritage, expertise, and technology –  Modern SQL:2011 capabilities –  DB2 compatible SQL PL support

•  SQL bodied functions and stored procedures •  Application logic/security encapsulation

•  Architected from the ground up for performance

–  low latency and high throughput

•  MapReduce replaced with a modern MPP architecture –  Compiler and runtime are native code (not java) –  Big SQL worker daemons live directly on cluster –  Continuously running (no startup latency) –  Processing happens locally at the data

•  Operations occur in memory with the ability

to spill to disk –  Supports aggregations and sorts larger than available RAM

•  Integration with BigSheets (source & target)

InfoSphere BigInsights

Big SQL SQL MPP Runtime

Data Sources

Parquet CSV Seq RC

Avro ORC JSON Custom

SQL-based Application

IBM Data Server Client


Big SQL 3.0 – Architecture cont.

•  Head (coordinator / management) node –  Listens to the JDBC/ODBC connections and compiles / optimizes the query –  Coordinates the execution of the query –  Optionally store user data in traditional RDBMS table (single node only)

•  Big SQL worker processes reside on compute nodes (some or all) •  Worker nodes stream data between each other as needed •  Workers can spill large data sets to local disk if needed

–  Allows Big SQL to work with data sets larger than available memory

Mgmt Node

Big SQL

Mgmt Node

Hive Metastore

Mgmt Node

Name Node

Mgmt Node

Job Tracker •••

Compute Node

Task Tracker

Data Node

Compute Node

Task Tracker

Data Node

Compute Node

Task Tracker

Data Node

Compute Node

Task Tracker

Data Node ••• Big

SQL Big SQL

Big SQL

Big SQL

GPFS/HDFS


Big SQL 3.0 – Features

Data shared with Hadoop ecosystem Comprehensive file format support

Superior enablement of IBM software Enhanced by Third Party software

Modern MPP runtime Powerful SQL query rewriter

Cost based optimizer Optimized for concurrent user throughput

Results not constrained by memory

Distributed requests to multiple data sources within a single SQL statement

Main data sources supported: DB2 LUW, DB2/z, Teradata, Oracle, Netezza

Advanced security/auditing Resource and workload management

Self tuning memory management Comprehensive monitoring

Comprehensive SQL Support IBM SQL PL compatibility

Application Portability & Integration

Federation

Performance

Enterprise Features

Rich SQL


BigSQL Demo


Comparing Big SQL 3.0 and Hive 0.12 for Ad-Hoc Queries

0

500

1000

1500

2000

2500

3000

3500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Elap

sed Time (sec)

Query number

BigSQL 3.0 Parquet vs Hive 0.12 ORC 1TB Classic BI Workload

Hive 0.12 BigSQL 3.0

Big SQL is up to 41x faster

than Hive 0.12

*Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic BI Workload" in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions are performed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014


IBM BigInsights brings efficient integration of R with Big R

•  R as a big data query language – Outside-in execution

•  R as a statistical language for deep computing –  Inside-out execution – Partitioning of large data (“divide”) – Parallel cluster execution of pushed

down R code (“conquer”) – Almost any R package can run in

this environment

•  R as the gateway to scalable machine learning – A scalable ML engine that provides

canned algorithms, and an ability to author new ones, all via R

R Clients

Scalable ML

Engine

Data Sources

Embedded R Execution

R Packages

R Packages

Pull data (summaries) to

R client

Or, push R functions right

on the data


Text Analytics in BigInsights

Distill structured information from unstructured data

–  Rich annotator library supports multiple languages

–  Declarative Information Extraction (IE) system based on an algebraic framework

–  Richer, cleaner rule semantics –  Better performance through optimization

How it works •  Parses text and detects meaning with annotators •  Understands the context in which the text is

analyzed •  Hundreds of pre-built annotators for names,

addresses, phone numbers, along others

Accuracy •  Highly accurate in deriving meaning from

complex text

Performance •  AQL language optimized for MapReduce

Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win.

Unstructured text (document, email, etc)

Classification and Insight


BigInsights offers value beyond Open Source

Enterprise Capabilities

Administration & Security

Workload Optimization

Connectors

Open source components

Advanced Engines

Visualization & Exploration

Development Tools

IBM-certified Apache Hadoop

Key differentiators •  Built-in analytics •  Enterprise software integration •  Spreadsheet-style analysis •  Integrated installation of supported open

source and other components •  Web Console for admin and application

access •  Platform enrichment: additional security,

performance features, . . . •  World-class support •  Full open source compatibility

Business benefits •  Quicker time-to-value due to IBM

technology and support •  Reduced operational risk •  Enhanced business knowledge with flexible

analytical platform •  Leverages and complements existing

software


InfoSphere BigInsights for Hadoop includes the latest Open Source components, enhanced by enterprise components IBM InfoSphere BigInsights for Hadoop

Runtime

File System

Data Store

Resource M

anagement &

A

dministration

Security Data

Access

Advanced Analytics

Visualization & Ad Hoc Analytics

Applications & Development

Governance

MapReduce

HBase

HDFS

IBM Open Source

Text Analytics R Big R

Kerberos

Audit &

History GPFS FPO

Adaptive MapReduce

Console

Monitoring

LDA

P D

ata Security for H

adoop

Data P

rivacy for Hadoop

Data M

atching D

ata Masking

Stream Computing

Search

Streams

Enterprise S

earch S

olr/ Lucene

Jaql

Pig Hive

ZooKeeper

Oozie

Big SQL

Flexible S

cheduler

ETL

BigSheets

Dashboard Charting

Eclipse Tooling: MapReduce, Hive, Jaql,

Pig, Big SQL, AQL

BigSheets Reader and Macro

Text Analytics Extractors

Flume

Sqoop

HCatalog

YAR

N*

* In Beta


From Getting Starting to Enterprise Deployment: Different BigInsights Editions For Varying Needs

Standard Edition

Breadth of capabilities

Ente

rpris

e cl

ass

Enterprise Edition

- Spreadsheet-style tool - - Web console - - Dashboards - Pre-built applications - - Eclipse tooling - - RDBMS connectivity - - Big SQL - - Monitoring and alerts - - Platform enhancements - - . . .

- Accelerators - - GPFS – FPO - - Adaptive MapReduce - Text analytics - Enterprise Integration - - Big R - - InfoSphere Streams* - - Watson Explorer* - - Cognos BI* - - Data Click* - - . . .

- * Limited use license

Apache Hadoop

Quick Start Free. Non-production Same features as Standard Edition plus text analytics and Big R

IBM big data • IBM big data • IBM big data

IBM big data • IBM big data • IBM big data

IBM

big

dat

a

• IB

M b

ig d

ata

IBM

big data • IBM

big data

THINK