using your db2 sql skills with hadoop and spark

© 2016 IBM Corporation

Using your DB2 skills with Hadoop and Spark Presented to TRIDEX DB2 Users Group, June 2017

C. M. Saracco, IBM Silicon Valley Labhttps://www.slideshare.net/CynthiaSaracco/presentations

© 2016 IBM Corporation2

Executive summary

§ About Apache Hadoop and Spark − Popular open source technologies for working with Big Data

• Clustered computing > scalability• Varied data > no pre-set structure or schema requirements

− Hadoop: distributed file system (storage), MapReduce API, . . . − Spark: in-memory data processing (speed), built-in libraries, . . .

§ About Big SQL− DB2-compatible query engine for Hadoop data (IBM or Hortonworks distributions)− Based on decades of IBM R&D investment in RDBMS technology, including database

parallelism and query optimization. Strong runtime performance for analytical workloads.

§ Some ways to leverage DB2 SQL skills − Create / manage / query “local” or distributed tables in Hadoop − Query / join Hadoop data with DB2, Oracle, Teradata, etc. data via query federation − Leverage Spark to query and manipulate Big SQL or DB2 data − Leverage Big SQL to initiate Spark jobs and analyze result


Agenda

§ Big Data background− Market drivers− Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities

§ Using Hadoop and Big SQL− Create tables / populate with data− Query tables − Explore query federation

§ Using Spark and Big SQL− Query data using Spark SQL− Launch Spark jobs from Big SQL

§ Performance: 100TB benchmark summary § Summary


Agenda






Business leaders frequently make decisions based on information they don’t trust, or don’t have1 in3

83% of CIOs cited “Business intelligence and analytics” as part of their visionary plansto enhance competitiveness

Business leaders say they don’t have access to the information they need to do their jobs

1 in2

of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions

60%

… and organizations need deeper insights

Information is at the center of a new wave of opportunity…

4 million “likes” per minute

300,000 tweets per minute

150 million emails per minute 2.78 million video

views per minute

2.5 TB per day per A350 plane

> 1 PB per day gas turbines

1 ZB = 1 billion TB


Big Data adoption (study results)

2012 to 2014 2015

22%-27% 25% 0%change

2012 to 2014 2015

24%-26% 10% 250%decrease

Educate:Learning about

big data capabilities

2012 to 2014 2015

43%-47% 53% 125%increase

Explore:Exploring internal use cases and

developing a strategy

Engage:Implementing infrastructure and

running pilot activities

2012 to 2014 2015

5%-6% 13% 210%increase

Execute:Using big data and analytics

pervasively across the enterprise

2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)


Return on investment period for big data and analytics projectsas reported by respondents

Big Data ROI often < 18 months

2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)


§ Both open source Apache projects − Exploit distributed computing environments− Enable processing of large volumes of varied data

§ Hadoop − Inspired by Google technologies (MapReduce, GFS) − Originally designed for batch-oriented, read-intensive applications − “Core” consists of distributed file system, MapReduce, job scheduler, utilities− Complementary projects span data warehousing, workflow management,

columnar data storage, activity monitoring, . . . § Spark

− Began as a UC Berkeley project − Fast, general-purpose engine for working with Big Data in memory− Popular built-in libraries for machine learning, streaming data, query (SQL), . . . − No built-in storage. Interfaces to Hadoop, other stores

About Hadoop and Spark


IBM contributions: Hadoop and Spark

Snapshots taken Jan. 2017.Latest content available online via Apache dashboards.

IOP relates to Hadoop; STC relates to Spark.


What is Big SQL? SQL-basedApplication

Big SQL Engine

Data Storage

IBM data server client

SQL MPP Run-time

HDFS

§ Comprehensive, standard SQL for Hadoop – SELECT: joins, unions, aggregates, subqueries . . . – UPDATE/DELETE (HBase-managed tables)– GRANT/REVOKE, INSERT … INTO– SQL procedural logic (SQL PL) – Stored procs, user-defined functions – IBM data server JDBC and ODBC drivers

§ Optimization and performance – IBM MPP engine (C++) replaces Java MapReduce layer– Continuous running daemons (no start up latency) – Message passing allow data to flow between nodes

without persisting intermediate results – In-memory operations with ability to spill to disk (useful

for aggregations, sorts that exceed available RAM) – Cost-based query optimization with 140+ rewrite rules

§ Various storage formats supported– Text (delimited), Sequence, RCFile, ORC, Avro, Parquet – Data persisted in DFS, Hive, HBase – No IBM proprietary format required

§ Integration with RDBMSs via LOAD, query federation

IBM Open Platform or Hortonworks Data Platform


Agenda






§ Big SQL − Easy on-ramp to Hadoop for DB2 SQL professionals − Create query-ready data lake− Offload “cold” RDBMS warehouse data to Hadoop − . . . .

§ Some ways to use Big SQL . . . − Create tables − Load / insert data− Execute complex queries − Exploit various DB2 features: UDFs, EXPLAIN, workload management, Oracle /

Netezza SQL compatibility. . . . − Exploit various Hadoop features: Hive, HBase, SerDes, . . .

About Hadoop and Big SQL


Invocation options

§ Command-line interface: Java SQL Shell (JSqsh)

§ Web tooling (Data Server Manager)

§ Tools that support IBM JDBC/ODBC driver


Creating a Big SQL table

§ Standard CREATE TABLE DDL with extensionscreate hadoop table users(

id int not null primary key,office_id int null,fname varchar(30) not null,lname varchar(30) not null)

row format delimitedfields terminated by '|'

stored as textfile;

Worth noting: • “Hadoop” keyword creates table in HDFS • Row format delimited and textfile formats are default • Constraints not enforced (but useful for query optimization)

• Examples in these charts focus on HDFS storage, both within or external to Hive warehouse. HBase examples provided separately


CREATE VIEW

§ Standard SQL syntax

create view my_users as

select fname, lname from biadmin.users where id > 100;


Populating tables via LOAD§ Typically best runtime performance§ Load data from local or remote file system

load hadoop using file url'sftp://myID:[email protected]:22/install-dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;

§ Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC connection

load hadoopusing jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'

with parameters (user='myID', password='myPassword') from table MEDIA columns (ID, NAME)where 'CONTACTDATE < ''2012-02-01'''into table media_db2table_jan overwritewith load properties ('num.map.tasks' = 10);


Populating tables via INSERT § INSERT INTO . . . SELECT FROM . . .

− Parallel read and write operations CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet( product_key INT NOT NULL, product_name VARCHAR(150), Quantity INT, order_method_en VARCHAR(90) )

STORED AS parquetfile;

-- source tables do not need to be in Parquet format insert into big_sales_parquetSELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en

FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb,sls_order_method_dim meth

WHEREpnumb.product_language='EN' AND sales.product_key=prod.product_key

AND prod.product_number=pnumb.product_numberAND meth.order_method_key=sales.order_method_key

and sales.quantity > 5500;

§ INSERT INTO . . . VALUES(. . . ) − Not parallelized. 1 file per INSERT. Not recommended except for quick tests

CREATE HADOOP TABLE foo col1 int, col2 varchar(10));

INSERT INTO foo VALUES (1, ‘hello’);


CREATE . . . TABLE . . . AS SELECT . . .§ Create a Big SQL table based on contents of other table(s)

§ Source tables can be in different file formats or use different underlying storage mechanisms

-- source tables in this example are external (just DFS files) CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat( product_key INT NOT NULL

, product_line_code INT NOT NULL, product_type_key INT NOT NULL, product_type_code INT NOT NULL, product_line_en VARCHAR(90), product_line_de VARCHAR(90) )

as select product_key, d.product_line_code, product_type_key,product_type_code, product_line_en, product_line_defrom extern.sls_product_dim d, extern.sls_product_line_lookup lwhere d.product_line_code = l.product_line_code;


SQL capability highlights§ Query operations

− Projections, restrictions− UNION, INTERSECT, EXCEPT− Wide range of built-in functions (e.g. OLAP) − Various Oracle, Netezza compatibility items

§ Full support for subqueries− In SELECT, FROM, WHERE and

HAVING clauses− Correlated and uncorrelated − Equality, non-equality subqueries− EXISTS, NOT EXISTS, IN, ANY,

SOME, etc.§ All standard join operations

− Standard and ANSI join syntax− Inner, outer, and full outer joins− Equality, non-equality, cross join support− Multi-value join

§ Stored procedures, user-defined functions, user-defined aggregates

SELECTs_name,count(*) AS numwait

FROMsupplier,lineitem l1,orders,nation

WHEREs_suppkey = l1.l_suppkeyAND o_orderkey = l1.l_orderkeyAND o_orderstatus = 'F'AND l1.l_receiptdate > l1.l_commitdateAND EXISTS (

SELECT*

FROMlineitem l2

WHEREl2.l_orderkey = l1.l_orderkeyAND l2.l_suppkey <> l1.l_suppkey

)AND NOT EXISTS (

SELECT*

FROMlineitem l3

WHEREl3.l_orderkey = l1.l_orderkeyAND l3.l_suppkey <> l1.l_suppkeyAND l3.l_receiptdate >

l3.l_commitdate)AND s_nationkey = n_nationkeyAND n_name = ':1'

GROUP BY s_nameORDER BY numwait desc, s_name;


Power of standard SQL

§ Big SQL executes all 22 TPC-H queries without modification

§ Big SQL executes all 99 TPC-DS queries without modification

§ Big SQL leverages DB2 query rewrite technology for efficient optimization

SELECT s_name, count(*) AS numwaitFROM supplier, lineitem l1, orders, nation

WHERE s_suppkey = l1.l_suppkeyAND o_orderkey = l1.l_orderkeyAND o_orderstatus = 'F'AND l1.l_receiptdate > l1.l_commitdateAND EXISTS (

SELECT *FROM lineitem l2WHERE l2.l_orderkey = l1.l_orderkey

AND l2.l_suppkey <> l1.l_suppkey)AND NOT EXISTS (

SELECT *FROM lineitem l3

WHERE l3.l_orderkey = l1.l_orderkeyAND l3.l_suppkey <> l1.l_suppkeyAND l3.l_receiptdate > l3.l_commitdate)

AND s_nationkey = n_nationkeyAND n_name = ':1'

GROUP BY s_nameORDER BY numwait desc, s_name

JOIN(SELECT s_name, l_orderkey, l_suppkey

FROM orders oJOIN

(SELECT s_name, l_orderkey, l_suppkeyFROM nation n

JOIN supplier sON s.s_nationkey = n.n_nationkeyAND n.n_name = 'INDONESIA'

JOIN lineitem lON s.s_suppkey = l.l_suppkey

WHERE l.l_receiptdate > l.l_commitdate) l1ON o.o_orderkey = l1.l_orderkeyAND o.o_orderstatus = 'F') l2

ON l2.l_orderkey = t1.l_orderkey) aWHERE (count_suppkey > 1) or ((count_suppkey=1)AND (l_suppkey <> max_suppkey))) l3

ON l3.l_orderkey = t2.l_orderkey) bWHERE (count_suppkey is null)

OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) cGROUP BY s_nameORDER BY numwait DESC, s_name

SELECT s_name, count(1) AS numwaitFROM

(SELECT s_name FROM(SELECT s_name, t2.l_orderkey, l_suppkey,

count_suppkey, max_suppkeyFROM

(SELECT l_orderkey,count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey

FROM lineitemWHERE l_receiptdate > l_commitdateGROUP BY l_orderkey) t2

RIGHT OUTER JOIN(SELECT s_name, l_orderkey, l_suppkey

FROM(SELECT s_name, t1.l_orderkey, l_suppkey,

count_suppkey, max_suppkeyFROM

(SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey

FROM lineitemGROUP BY l_orderkey) t1

Original Query

Re-written query


Query federation = virtualized data accessTransparent§ Appears to be one source§ Programmers don’t need to know how /

where data is stored

Heterogeneous§ Accesses data from diverse sources

High Function§ Full query support against all data§ Capabilities of sources as well

Autonomous§ Non-disruptive to data sources, existing

applications, systems.

High Performance§ Optimization of distributed queries

SQL tools, applications Data sources

Virtualizeddata


Federation in practice § Admin enables

federation

§ Apps connect to Big SQL database

§ Nicknames look like tables to the app

§ Big SQL optimizer creates global data access plan with cost analysis, query push down

§ Query fragments executed remotely

Nicknam

e

Nicknam

e

TableCost-based optimizer

WrapperClient library

WrapperClient library

Local + Remote Execution Plans

Remote sources

Federation server (Big SQL)

Native dialect

Connect to bigsql


Joining data across sources


Agenda






About Spark and Big SQL

§ Easy to query Big SQL (or DB2 LUW) tables through Spark SQL− See link to self-study lab in “Resources” section

§ Follow typical Spark SQL JDBC data source pattern − Identify JDBC driver and connection properties − Load table contents into DataFrame, Spark SQL temporary view− Execute Spark SQL queries − Applies to Big SQL tables in Hive warehouse, HBase, or arbitrary HDFS

directory− Query results can be manipulated via other Spark libraries

§ Technical preview: Launch Spark jobs from Big SQL via UDF


Accessing Big SQL data from Spark shell // based on BigInsights tech preview release that includes Spark 2.1// Launch shell with --driver-class-path pointing to JDBC driver .jar

// read data from Big SQL table “t1” and load into a DataFrameval sampleDF = spark.read.format("jdbc")

.option("url”,"jdbc:db2://yourHost.com:32051/BIGSQL")

.option("dbtable",”yourSchema.t1")

.option("user", "yourID").option("password", "yourPassword")

.load()

// display full contentssampleDF.show()

// create a Spark SQL temporary view to query sampleDF.createOrReplaceTempView("v1")

// query the view and display the resultssql("select col1, col3 from v1 where col2 > 100 limit 15”).show()


Technical preview: launch Spark jobs from Big SQL

§ Spark jobs can be invoked from Big SQL using a table UDF abstraction

§ Example: Call the SYSHADOOP.EXECSPARK built-in UDF to kick off a Spark job that reads a JSON file stored on HDFS

SELECT *FROM TABLE(SYSHADOOP.EXECSPARK(

language => 'scala',class =>

'com.ibm.biginsights.bigsql.examples.ReadJsonFile',uri =>

'hdfs://host.port.com:8020/user/bigsql/demo.json',card => 100000)) AS doc

WHERE doc.country IS NOT NULL


Agenda




§ Performance: 100TB benchmark summaryhttps://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/

§ Summary


WhatisTPC-DS?§ TPC=TransactionProcessingCouncil

− Non-profit corporation (vendor independent)− Definesvarious industrydrivendatabasebenchmarks….DS=DecisionSupport− Modelsamulti-domain datawarehouseenvironment forahypothetical retailer

Retail Sales Web Sales Inventory Demographics Promotions

Multiple scalefactors:100GB,300GB,1TB,3TB,10TB,30TBand100TB

99 Pre-Defined Queries

Query Classes:

Reporting Ad HocIterativeOLAP

Data Mining


100TBTPC-DSisBIG data


Benchmark Environment: IBM “F1” Spark SQL Cluster

§ 28 Nodes Total (Lenovo x3640 M5)

§ Each configured as:• 2 sockets (18 cores/socket)• 1.5 TB RAM• 8x 2TB SSD

§ 2 Racks− 20x 2U servers per rack (42U racks)

§ 1 Switch, 100GbE, 32 ports MellanoxSN2700


PERFORMANCE

SPARK SQL 2.1 HADOOP-DS @ 100TB: AT A GLANCE

WORKING QUERIES

COMPRESSION

60%SPACE SAVED

WITH PARQUET

Spark SQL completes more TPC-DS queries than any other

open source SQL engine for Hadoop@ 100TB Scale


Query Compliance Through the Scale Factors§ SQLcomplianceisimportantbecauseBusinessIntelligencetoolsgeneratestandardSQL

− Rewritingqueries ispainfulandimpactsproductivity

§ SparkSQL2.1canrunall99TPC-DSqueriesbutonlyatlowerscalefactors

§ SparkSQLFailures@100TB:− 12runtimeerrors− 4timeout (>10hours)

SparkSQL

§ BigSQLhasbeensuccessfullyexecutingall99queries sinceOct2014

§ IBMistheonlyvendor thathasprovenSQLcompatibilityatscalefactorsupto100TB

BigSQL


Big SQL is 3.2X faster than Spark 2.1(4 Concurrent Streams)

BigSQL@99queries stilloutperforms SparkSQL@83queries


PERFORMANCEBig SQL 3.2x faster

HADOOP-DS @ 100TB: AT A GLANCE WORKING QUERIES

CPU (vs Spark)

Big SQL uses 3.7x less CPU

I/O (vs Spark)

Big SQL reads 12x less data Big SQL writes 30x less data

COMPRESSION

60%SPACE SAVEDWITH PARQUET

AVERAGE CPU USAGE76.4%MAX I/O THROUGHPUT: READ 4.4 GB/SECWRITE 2.8 GB/SEC


Recommendation: Right Tool for the Right Job

Machine Learning

Simpler SQL

Good Performance

Ideal tool for BI Data Analysts and production workloads

Ideal tool for Data Scientists and discovery

Big SQL Spark SQL

Migrating existing workloads to Hadoop

Security

Many Concurrent Users

Best Performance

Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster


Agenda






Summary

§ Big SQL = easy path for DB2 professionals to work with Big Data

§ Runs on popular Hadoop platforms from IBM, Hortonworks

§ Integrates with Spark

§ Compatible with DB2 and ISO SQL

§ Brings high-performance, enterprise-grade query engine to popular open source Big Data platforms


Want to learn more?

§ Hadoop Devhttps://developer.ibm.com/hadoop/

§ Labs: Big SQL intro, Spark / Big SQL, . . . https://developer.ibm.com/hadoop/docs/getting-started/tutorials/big-sql-hadoop-tutorial/

§ 100TB benchmarkhttps://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/

§ This presentationhttps://www.slideshare.net/CynthiaSaracco/presentations


Supplemental


Big SQL architecture§ Head (coordinator / management) node

− Listens to the JDBC/ODBC connections− Compiles and optimizes the query− Optionally store user data in DB2-compatible table (single node only). Useful for some reference data.

§ Big SQL worker processes reside on compute nodes (some or all)§ Worker nodes stream data between each other as needed§ Workers can spill large data sets to local disk if needed

− Allows Big SQL to work with data sets larger than available memory


CPU Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams

SparkSQLusesalmost3x moresystemCPU.ThesearewastedCPUcycles.

AverageCPUUtilization:76.4%

AverageCPUUtilization:88.2%


I/O Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams

Spark SQL required

3.6X more reads9.5X more writes

Big SQL can drive peak I/O nearly

2X more

using your db2 sql skills with hadoop and spark

Technology