using your db2 sql skills with hadoop and spark

43
© 2016 IBM Corporation Using your DB2 skills with Hadoop and Spark Presented to TRIDEX DB2 Users Group, June 2017 C. M. Saracco, IBM Silicon Valley Lab https://www.slideshare.net/CynthiaSaracco/presentations

Upload: cynthia-saracco

Post on 28-Jan-2018

451 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation

Using your DB2 skills with Hadoop and Spark Presented to TRIDEX DB2 Users Group, June 2017

C. M. Saracco, IBM Silicon Valley Labhttps://www.slideshare.net/CynthiaSaracco/presentations

Page 2: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation2

Executive summary

§ About Apache Hadoop and Spark − Popular open source technologies for working with Big Data

• Clustered computing > scalability• Varied data > no pre-set structure or schema requirements

− Hadoop: distributed file system (storage), MapReduce API, . . . − Spark: in-memory data processing (speed), built-in libraries, . . .

§ About Big SQL− DB2-compatible query engine for Hadoop data (IBM or Hortonworks distributions)− Based on decades of IBM R&D investment in RDBMS technology, including database

parallelism and query optimization. Strong runtime performance for analytical workloads.

§ Some ways to leverage DB2 SQL skills − Create / manage / query “local” or distributed tables in Hadoop − Query / join Hadoop data with DB2, Oracle, Teradata, etc. data via query federation − Leverage Spark to query and manipulate Big SQL or DB2 data − Leverage Big SQL to initiate Spark jobs and analyze result

Page 3: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation3

Agenda

§ Big Data background− Market drivers− Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities

§ Using Hadoop and Big SQL− Create tables / populate with data− Query tables − Explore query federation

§ Using Spark and Big SQL− Query data using Spark SQL− Launch Spark jobs from Big SQL

§ Performance: 100TB benchmark summary § Summary

Page 4: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation4

Agenda

§ Big Data background− Market drivers− Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities

§ Using Hadoop and Big SQL− Create tables / populate with data− Query tables − Explore query federation

§ Using Spark and Big SQL− Query data using Spark SQL− Launch Spark jobs from Big SQL

§ Performance: 100TB benchmark summary § Summary

Page 5: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation5

Business leaders frequently make decisions based on information they don’t trust, or don’t have1 in3

83% of CIOs cited “Business intelligence and analytics” as part of their visionary plansto enhance competitiveness

Business leaders say they don’t have access to the information they need to do their jobs

1 in2

of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions

60%

… and organizations need deeper insights

Information is at the center of a new wave of opportunity…

4 million “likes” per minute

300,000 tweets per minute

150 million emails per minute 2.78 million video

views per minute

2.5 TB per day per A350 plane

> 1 PB per day gas turbines

1 ZB = 1 billion TB

Page 6: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation6

Big Data adoption (study results)

2012 to 2014 2015

22%-27% 25% 0%change

2012 to 2014 2015

24%-26% 10% 250%decrease

Educate:Learning about

big data capabilities

2012 to 2014 2015

43%-47% 53% 125%increase

Explore:Exploring internal use cases and

developing a strategy

Engage:Implementing infrastructure and

running pilot activities

2012 to 2014 2015

5%-6% 13% 210%increase

Execute:Using big data and analytics

pervasively across the enterprise

2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)

Page 7: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation7

Return on investment period for big data and analytics projectsas reported by respondents

Big Data ROI often < 18 months

2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)

Page 8: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation8

§ Both open source Apache projects − Exploit distributed computing environments− Enable processing of large volumes of varied data

§ Hadoop − Inspired by Google technologies (MapReduce, GFS) − Originally designed for batch-oriented, read-intensive applications − “Core” consists of distributed file system, MapReduce, job scheduler, utilities− Complementary projects span data warehousing, workflow management,

columnar data storage, activity monitoring, . . . § Spark

− Began as a UC Berkeley project − Fast, general-purpose engine for working with Big Data in memory− Popular built-in libraries for machine learning, streaming data, query (SQL), . . . − No built-in storage. Interfaces to Hadoop, other stores

About Hadoop and Spark

Page 9: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation9

IBM contributions: Hadoop and Spark

Snapshots taken Jan. 2017.Latest content available online via Apache dashboards.

IOP relates to Hadoop; STC relates to Spark.

Page 10: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation10

What is Big SQL? SQL-basedApplication

Big SQL Engine

Data Storage

IBM data server client

SQL MPP Run-time

HDFS

§ Comprehensive, standard SQL for Hadoop – SELECT: joins, unions, aggregates, subqueries . . . – UPDATE/DELETE (HBase-managed tables)– GRANT/REVOKE, INSERT … INTO– SQL procedural logic (SQL PL) – Stored procs, user-defined functions – IBM data server JDBC and ODBC drivers

§ Optimization and performance – IBM MPP engine (C++) replaces Java MapReduce layer– Continuous running daemons (no start up latency) – Message passing allow data to flow between nodes

without persisting intermediate results – In-memory operations with ability to spill to disk (useful

for aggregations, sorts that exceed available RAM) – Cost-based query optimization with 140+ rewrite rules

§ Various storage formats supported– Text (delimited), Sequence, RCFile, ORC, Avro, Parquet – Data persisted in DFS, Hive, HBase – No IBM proprietary format required

§ Integration with RDBMSs via LOAD, query federation

IBM Open Platform or Hortonworks Data Platform

Page 11: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation11

Agenda

§ Big Data background− Market drivers− Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities

§ Using Hadoop and Big SQL− Create tables / populate with data− Query tables − Explore query federation

§ Using Spark and Big SQL− Query data using Spark SQL− Launch Spark jobs from Big SQL

§ Performance: 100TB benchmark summary § Summary

Page 12: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation12

§ Big SQL − Easy on-ramp to Hadoop for DB2 SQL professionals − Create query-ready data lake− Offload “cold” RDBMS warehouse data to Hadoop − . . . .

§ Some ways to use Big SQL . . . − Create tables − Load / insert data− Execute complex queries − Exploit various DB2 features: UDFs, EXPLAIN, workload management, Oracle /

Netezza SQL compatibility. . . . − Exploit various Hadoop features: Hive, HBase, SerDes, . . .

About Hadoop and Big SQL

Page 13: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation13

Invocation options

§ Command-line interface: Java SQL Shell (JSqsh)

§ Web tooling (Data Server Manager)

§ Tools that support IBM JDBC/ODBC driver

Page 14: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation14

Creating a Big SQL table

§ Standard CREATE TABLE DDL with extensionscreate hadoop table users(

id int not null primary key,office_id int null,fname varchar(30) not null,lname varchar(30) not null)

row format delimitedfields terminated by '|'

stored as textfile;

Worth noting: • “Hadoop” keyword creates table in HDFS • Row format delimited and textfile formats are default • Constraints not enforced (but useful for query optimization)

• Examples in these charts focus on HDFS storage, both within or external to Hive warehouse. HBase examples provided separately

Page 15: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation15

CREATE VIEW

§ Standard SQL syntax

create view my_users as

select fname, lname from biadmin.users where id > 100;

Page 16: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation16

Populating tables via LOAD§ Typically best runtime performance§ Load data from local or remote file system

load hadoop using file url'sftp://myID:[email protected]:22/install-dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;

§ Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC connection

load hadoopusing jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'

with parameters (user='myID', password='myPassword') from table MEDIA columns (ID, NAME)where 'CONTACTDATE < ''2012-02-01'''into table media_db2table_jan overwritewith load properties ('num.map.tasks' = 10);

Page 17: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation17

Populating tables via INSERT § INSERT INTO . . . SELECT FROM . . .

− Parallel read and write operations CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet( product_key INT NOT NULL, product_name VARCHAR(150), Quantity INT, order_method_en VARCHAR(90) )

STORED AS parquetfile;

-- source tables do not need to be in Parquet format insert into big_sales_parquetSELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en

FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb,sls_order_method_dim meth

WHEREpnumb.product_language='EN' AND sales.product_key=prod.product_key

AND prod.product_number=pnumb.product_numberAND meth.order_method_key=sales.order_method_key

and sales.quantity > 5500;

§ INSERT INTO . . . VALUES(. . . ) − Not parallelized. 1 file per INSERT. Not recommended except for quick tests

CREATE HADOOP TABLE foo col1 int, col2 varchar(10));

INSERT INTO foo VALUES (1, ‘hello’);

Page 18: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation18

CREATE . . . TABLE . . . AS SELECT . . .§ Create a Big SQL table based on contents of other table(s)

§ Source tables can be in different file formats or use different underlying storage mechanisms

-- source tables in this example are external (just DFS files) CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat( product_key INT NOT NULL

, product_line_code INT NOT NULL, product_type_key INT NOT NULL, product_type_code INT NOT NULL, product_line_en VARCHAR(90), product_line_de VARCHAR(90) )

as select product_key, d.product_line_code, product_type_key,product_type_code, product_line_en, product_line_defrom extern.sls_product_dim d, extern.sls_product_line_lookup lwhere d.product_line_code = l.product_line_code;

Page 19: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation19

SQL capability highlights§ Query operations

− Projections, restrictions− UNION, INTERSECT, EXCEPT− Wide range of built-in functions (e.g. OLAP) − Various Oracle, Netezza compatibility items

§ Full support for subqueries− In SELECT, FROM, WHERE and

HAVING clauses− Correlated and uncorrelated − Equality, non-equality subqueries− EXISTS, NOT EXISTS, IN, ANY,

SOME, etc.§ All standard join operations

− Standard and ANSI join syntax− Inner, outer, and full outer joins− Equality, non-equality, cross join support− Multi-value join

§ Stored procedures, user-defined functions, user-defined aggregates

SELECTs_name,count(*) AS numwait

FROMsupplier,lineitem l1,orders,nation

WHEREs_suppkey = l1.l_suppkeyAND o_orderkey = l1.l_orderkeyAND o_orderstatus = 'F'AND l1.l_receiptdate > l1.l_commitdateAND EXISTS (

SELECT*

FROMlineitem l2

WHEREl2.l_orderkey = l1.l_orderkeyAND l2.l_suppkey <> l1.l_suppkey

)AND NOT EXISTS (

SELECT*

FROMlineitem l3

WHEREl3.l_orderkey = l1.l_orderkeyAND l3.l_suppkey <> l1.l_suppkeyAND l3.l_receiptdate >

l3.l_commitdate)AND s_nationkey = n_nationkeyAND n_name = ':1'

GROUP BY s_nameORDER BY numwait desc, s_name;

Page 20: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation20

Power of standard SQL

§ Big SQL executes all 22 TPC-H queries without modification

§ Big SQL executes all 99 TPC-DS queries without modification

§ Big SQL leverages DB2 query rewrite technology for efficient optimization

SELECT s_name, count(*) AS numwaitFROM supplier, lineitem l1, orders, nation

WHERE s_suppkey = l1.l_suppkeyAND o_orderkey = l1.l_orderkeyAND o_orderstatus = 'F'AND l1.l_receiptdate > l1.l_commitdateAND EXISTS (

SELECT *FROM lineitem l2WHERE l2.l_orderkey = l1.l_orderkey

AND l2.l_suppkey <> l1.l_suppkey)AND NOT EXISTS (

SELECT *FROM lineitem l3

WHERE l3.l_orderkey = l1.l_orderkeyAND l3.l_suppkey <> l1.l_suppkeyAND l3.l_receiptdate > l3.l_commitdate)

AND s_nationkey = n_nationkeyAND n_name = ':1'

GROUP BY s_nameORDER BY numwait desc, s_name

JOIN(SELECT s_name, l_orderkey, l_suppkey

FROM orders oJOIN

(SELECT s_name, l_orderkey, l_suppkeyFROM nation n

JOIN supplier sON s.s_nationkey = n.n_nationkeyAND n.n_name = 'INDONESIA'

JOIN lineitem lON s.s_suppkey = l.l_suppkey

WHERE l.l_receiptdate > l.l_commitdate) l1ON o.o_orderkey = l1.l_orderkeyAND o.o_orderstatus = 'F') l2

ON l2.l_orderkey = t1.l_orderkey) aWHERE (count_suppkey > 1) or ((count_suppkey=1)AND (l_suppkey <> max_suppkey))) l3

ON l3.l_orderkey = t2.l_orderkey) bWHERE (count_suppkey is null)

OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) cGROUP BY s_nameORDER BY numwait DESC, s_name

SELECT s_name, count(1) AS numwaitFROM

(SELECT s_name FROM(SELECT s_name, t2.l_orderkey, l_suppkey,

count_suppkey, max_suppkeyFROM

(SELECT l_orderkey,count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey

FROM lineitemWHERE l_receiptdate > l_commitdateGROUP BY l_orderkey) t2

RIGHT OUTER JOIN(SELECT s_name, l_orderkey, l_suppkey

FROM(SELECT s_name, t1.l_orderkey, l_suppkey,

count_suppkey, max_suppkeyFROM

(SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey

FROM lineitemGROUP BY l_orderkey) t1

Original Query

Re-written query

Page 21: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation21

Query federation = virtualized data accessTransparent§ Appears to be one source§ Programmers don’t need to know how /

where data is stored

Heterogeneous§ Accesses data from diverse sources

High Function§ Full query support against all data§ Capabilities of sources as well

Autonomous§ Non-disruptive to data sources, existing

applications, systems.

High Performance§ Optimization of distributed queries

SQL tools, applications Data sources

Virtualizeddata

Page 22: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation22

Federation in practice § Admin enables

federation

§ Apps connect to Big SQL database

§ Nicknames look like tables to the app

§ Big SQL optimizer creates global data access plan with cost analysis, query push down

§ Query fragments executed remotely

Nicknam

e

Nicknam

e

TableCost-based optimizer

WrapperClient library

WrapperClient library

Local + Remote Execution Plans

Remote sources

Federation server (Big SQL)

Native dialect

Connect to bigsql

Page 23: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation23

Joining data across sources

Page 24: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation24

Agenda

§ Big Data background− Market drivers− Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities

§ Using Hadoop and Big SQL− Create tables / populate with data− Query tables − Explore query federation

§ Using Spark and Big SQL− Query data using Spark SQL− Launch Spark jobs from Big SQL

§ Performance: 100TB benchmark summary § Summary

Page 25: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation25

About Spark and Big SQL

§ Easy to query Big SQL (or DB2 LUW) tables through Spark SQL− See link to self-study lab in “Resources” section

§ Follow typical Spark SQL JDBC data source pattern − Identify JDBC driver and connection properties − Load table contents into DataFrame, Spark SQL temporary view− Execute Spark SQL queries − Applies to Big SQL tables in Hive warehouse, HBase, or arbitrary HDFS

directory− Query results can be manipulated via other Spark libraries

§ Technical preview: Launch Spark jobs from Big SQL via UDF

Page 26: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation26

Accessing Big SQL data from Spark shell // based on BigInsights tech preview release that includes Spark 2.1// Launch shell with --driver-class-path pointing to JDBC driver .jar

// read data from Big SQL table “t1” and load into a DataFrameval sampleDF = spark.read.format("jdbc")

.option("url”,"jdbc:db2://yourHost.com:32051/BIGSQL")

.option("dbtable",”yourSchema.t1")

.option("user", "yourID").option("password", "yourPassword")

.load()

// display full contentssampleDF.show()

// create a Spark SQL temporary view to query sampleDF.createOrReplaceTempView("v1")

// query the view and display the resultssql("select col1, col3 from v1 where col2 > 100 limit 15”).show()

Page 27: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation27

Technical preview: launch Spark jobs from Big SQL

§ Spark jobs can be invoked from Big SQL using a table UDF abstraction

§ Example: Call the SYSHADOOP.EXECSPARK built-in UDF to kick off a Spark job that reads a JSON file stored on HDFS

SELECT *FROM TABLE(SYSHADOOP.EXECSPARK(

language => 'scala',class =>

'com.ibm.biginsights.bigsql.examples.ReadJsonFile',uri =>

'hdfs://host.port.com:8020/user/bigsql/demo.json',card => 100000)) AS doc

WHERE doc.country IS NOT NULL

Page 28: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation28

Agenda

§ Big Data background− Market drivers− Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities

§ Using Hadoop and Big SQL− Create tables / populate with data− Query tables − Explore query federation

§ Using Spark and Big SQL− Query data using Spark SQL− Launch Spark jobs from Big SQL

§ Performance: 100TB benchmark summaryhttps://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/

§ Summary

Page 29: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation29

WhatisTPC-DS?§ TPC=TransactionProcessingCouncil

− Non-profit corporation (vendor independent)− Definesvarious industrydrivendatabasebenchmarks….DS=DecisionSupport− Modelsamulti-domain datawarehouseenvironment forahypothetical retailer

Retail Sales Web Sales Inventory Demographics Promotions

Multiple scalefactors:100GB,300GB,1TB,3TB,10TB,30TBand100TB

99 Pre-Defined Queries

Query Classes:

Reporting Ad HocIterativeOLAP

Data Mining

Page 30: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation30

100TBTPC-DSisBIG data

Page 31: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation31

Benchmark Environment: IBM “F1” Spark SQL Cluster

§ 28 Nodes Total (Lenovo x3640 M5)

§ Each configured as:• 2 sockets (18 cores/socket)• 1.5 TB RAM• 8x 2TB SSD

§ 2 Racks− 20x 2U servers per rack (42U racks)

§ 1 Switch, 100GbE, 32 ports MellanoxSN2700

Page 32: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation32

PERFORMANCE

SPARK SQL 2.1 HADOOP-DS @ 100TB: AT A GLANCE

WORKING QUERIES

COMPRESSION

60%SPACE SAVED

WITH PARQUET

Spark SQL completes more TPC-DS queries than any other

open source SQL engine for Hadoop@ 100TB Scale

Page 33: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation33

Query Compliance Through the Scale Factors§ SQLcomplianceisimportantbecauseBusinessIntelligencetoolsgeneratestandardSQL

− Rewritingqueries ispainfulandimpactsproductivity

§ SparkSQL2.1canrunall99TPC-DSqueriesbutonlyatlowerscalefactors

§ SparkSQLFailures@100TB:− 12runtimeerrors− 4timeout (>10hours)

SparkSQL

§ BigSQLhasbeensuccessfullyexecutingall99queries sinceOct2014

§ IBMistheonlyvendor thathasprovenSQLcompatibilityatscalefactorsupto100TB

BigSQL

Page 34: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation34

Big SQL is 3.2X faster than Spark 2.1(4 Concurrent Streams)

BigSQL@99queries stilloutperforms SparkSQL@83queries

Page 35: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation35

PERFORMANCEBig SQL 3.2x faster

HADOOP-DS @ 100TB: AT A GLANCE WORKING QUERIES

CPU (vs Spark)

Big SQL uses 3.7x less CPU

I/O (vs Spark)

Big SQL reads 12x less data Big SQL writes 30x less data

COMPRESSION

60%SPACE SAVEDWITH PARQUET

AVERAGE CPU USAGE76.4%MAX I/O THROUGHPUT: READ 4.4 GB/SECWRITE 2.8 GB/SEC

Page 36: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation36

Recommendation: Right Tool for the Right Job

Machine Learning

Simpler SQL

Good Performance

Ideal tool for BI Data Analysts and production workloads

Ideal tool for Data Scientists and discovery

Big SQL Spark SQL

Migrating existing workloads to Hadoop

Security

Many Concurrent Users

Best Performance

Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster

Page 37: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation37

Agenda

§ Big Data background− Market drivers− Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities

§ Using Hadoop and Big SQL− Create tables / populate with data− Query tables − Explore query federation

§ Using Spark and Big SQL− Query data using Spark SQL− Launch Spark jobs from Big SQL

§ Performance: 100TB benchmark summary § Summary

Page 38: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation38

Summary

§ Big SQL = easy path for DB2 professionals to work with Big Data

§ Runs on popular Hadoop platforms from IBM, Hortonworks

§ Integrates with Spark

§ Compatible with DB2 and ISO SQL

§ Brings high-performance, enterprise-grade query engine to popular open source Big Data platforms

Page 39: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation39

Want to learn more?

§ Hadoop Devhttps://developer.ibm.com/hadoop/

§ Labs: Big SQL intro, Spark / Big SQL, . . . https://developer.ibm.com/hadoop/docs/getting-started/tutorials/big-sql-hadoop-tutorial/

§ 100TB benchmarkhttps://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/

§ This presentationhttps://www.slideshare.net/CynthiaSaracco/presentations

Page 40: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation40

Supplemental

Page 41: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation41

Big SQL architecture§ Head (coordinator / management) node

− Listens to the JDBC/ODBC connections− Compiles and optimizes the query− Optionally store user data in DB2-compatible table (single node only). Useful for some reference data.

§ Big SQL worker processes reside on compute nodes (some or all)§ Worker nodes stream data between each other as needed§ Workers can spill large data sets to local disk if needed

− Allows Big SQL to work with data sets larger than available memory

Page 42: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation42

CPU Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams

SparkSQLusesalmost3x moresystemCPU.ThesearewastedCPUcycles.

AverageCPUUtilization:76.4%

AverageCPUUtilization:88.2%

Page 43: Using your DB2 SQL Skills with Hadoop and Spark

© 2016 IBM Corporation43

I/O Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams

Spark SQL required

3.6X more reads9.5X more writes

Big SQL can drive peak I/O nearly

2X more