using your db2 sql skills with hadoop and spark
TRANSCRIPT
© 2016 IBM Corporation
Using your DB2 skills with Hadoop and Spark Presented to TRIDEX DB2 Users Group, June 2017
C. M. Saracco, IBM Silicon Valley Labhttps://www.slideshare.net/CynthiaSaracco/presentations
© 2016 IBM Corporation2
Executive summary
§ About Apache Hadoop and Spark − Popular open source technologies for working with Big Data
• Clustered computing > scalability• Varied data > no pre-set structure or schema requirements
− Hadoop: distributed file system (storage), MapReduce API, . . . − Spark: in-memory data processing (speed), built-in libraries, . . .
§ About Big SQL− DB2-compatible query engine for Hadoop data (IBM or Hortonworks distributions)− Based on decades of IBM R&D investment in RDBMS technology, including database
parallelism and query optimization. Strong runtime performance for analytical workloads.
§ Some ways to leverage DB2 SQL skills − Create / manage / query “local” or distributed tables in Hadoop − Query / join Hadoop data with DB2, Oracle, Teradata, etc. data via query federation − Leverage Spark to query and manipulate Big SQL or DB2 data − Leverage Big SQL to initiate Spark jobs and analyze result
© 2016 IBM Corporation3
Agenda
§ Big Data background− Market drivers− Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities
§ Using Hadoop and Big SQL− Create tables / populate with data− Query tables − Explore query federation
§ Using Spark and Big SQL− Query data using Spark SQL− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary § Summary
© 2016 IBM Corporation4
Agenda
§ Big Data background− Market drivers− Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities
§ Using Hadoop and Big SQL− Create tables / populate with data− Query tables − Explore query federation
§ Using Spark and Big SQL− Query data using Spark SQL− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary § Summary
© 2016 IBM Corporation5
Business leaders frequently make decisions based on information they don’t trust, or don’t have1 in3
83% of CIOs cited “Business intelligence and analytics” as part of their visionary plansto enhance competitiveness
Business leaders say they don’t have access to the information they need to do their jobs
1 in2
of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions
60%
… and organizations need deeper insights
Information is at the center of a new wave of opportunity…
4 million “likes” per minute
300,000 tweets per minute
150 million emails per minute 2.78 million video
views per minute
2.5 TB per day per A350 plane
> 1 PB per day gas turbines
1 ZB = 1 billion TB
© 2016 IBM Corporation6
Big Data adoption (study results)
2012 to 2014 2015
22%-27% 25% 0%change
2012 to 2014 2015
24%-26% 10% 250%decrease
Educate:Learning about
big data capabilities
2012 to 2014 2015
43%-47% 53% 125%increase
Explore:Exploring internal use cases and
developing a strategy
Engage:Implementing infrastructure and
running pilot activities
2012 to 2014 2015
5%-6% 13% 210%increase
Execute:Using big data and analytics
pervasively across the enterprise
2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)
© 2016 IBM Corporation7
Return on investment period for big data and analytics projectsas reported by respondents
Big Data ROI often < 18 months
2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)
© 2016 IBM Corporation8
§ Both open source Apache projects − Exploit distributed computing environments− Enable processing of large volumes of varied data
§ Hadoop − Inspired by Google technologies (MapReduce, GFS) − Originally designed for batch-oriented, read-intensive applications − “Core” consists of distributed file system, MapReduce, job scheduler, utilities− Complementary projects span data warehousing, workflow management,
columnar data storage, activity monitoring, . . . § Spark
− Began as a UC Berkeley project − Fast, general-purpose engine for working with Big Data in memory− Popular built-in libraries for machine learning, streaming data, query (SQL), . . . − No built-in storage. Interfaces to Hadoop, other stores
About Hadoop and Spark
© 2016 IBM Corporation9
IBM contributions: Hadoop and Spark
Snapshots taken Jan. 2017.Latest content available online via Apache dashboards.
IOP relates to Hadoop; STC relates to Spark.
© 2016 IBM Corporation10
What is Big SQL? SQL-basedApplication
Big SQL Engine
Data Storage
IBM data server client
SQL MPP Run-time
HDFS
§ Comprehensive, standard SQL for Hadoop – SELECT: joins, unions, aggregates, subqueries . . . – UPDATE/DELETE (HBase-managed tables)– GRANT/REVOKE, INSERT … INTO– SQL procedural logic (SQL PL) – Stored procs, user-defined functions – IBM data server JDBC and ODBC drivers
§ Optimization and performance – IBM MPP engine (C++) replaces Java MapReduce layer– Continuous running daemons (no start up latency) – Message passing allow data to flow between nodes
without persisting intermediate results – In-memory operations with ability to spill to disk (useful
for aggregations, sorts that exceed available RAM) – Cost-based query optimization with 140+ rewrite rules
§ Various storage formats supported– Text (delimited), Sequence, RCFile, ORC, Avro, Parquet – Data persisted in DFS, Hive, HBase – No IBM proprietary format required
§ Integration with RDBMSs via LOAD, query federation
IBM Open Platform or Hortonworks Data Platform
© 2016 IBM Corporation11
Agenda
§ Big Data background− Market drivers− Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities
§ Using Hadoop and Big SQL− Create tables / populate with data− Query tables − Explore query federation
§ Using Spark and Big SQL− Query data using Spark SQL− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary § Summary
© 2016 IBM Corporation12
§ Big SQL − Easy on-ramp to Hadoop for DB2 SQL professionals − Create query-ready data lake− Offload “cold” RDBMS warehouse data to Hadoop − . . . .
§ Some ways to use Big SQL . . . − Create tables − Load / insert data− Execute complex queries − Exploit various DB2 features: UDFs, EXPLAIN, workload management, Oracle /
Netezza SQL compatibility. . . . − Exploit various Hadoop features: Hive, HBase, SerDes, . . .
About Hadoop and Big SQL
© 2016 IBM Corporation13
Invocation options
§ Command-line interface: Java SQL Shell (JSqsh)
§ Web tooling (Data Server Manager)
§ Tools that support IBM JDBC/ODBC driver
© 2016 IBM Corporation14
Creating a Big SQL table
§ Standard CREATE TABLE DDL with extensionscreate hadoop table users(
id int not null primary key,office_id int null,fname varchar(30) not null,lname varchar(30) not null)
row format delimitedfields terminated by '|'
stored as textfile;
Worth noting: • “Hadoop” keyword creates table in HDFS • Row format delimited and textfile formats are default • Constraints not enforced (but useful for query optimization)
• Examples in these charts focus on HDFS storage, both within or external to Hive warehouse. HBase examples provided separately
© 2016 IBM Corporation15
CREATE VIEW
§ Standard SQL syntax
create view my_users as
select fname, lname from biadmin.users where id > 100;
© 2016 IBM Corporation16
Populating tables via LOAD§ Typically best runtime performance§ Load data from local or remote file system
load hadoop using file url'sftp://myID:[email protected]:22/install-dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;
§ Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC connection
load hadoopusing jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'
with parameters (user='myID', password='myPassword') from table MEDIA columns (ID, NAME)where 'CONTACTDATE < ''2012-02-01'''into table media_db2table_jan overwritewith load properties ('num.map.tasks' = 10);
© 2016 IBM Corporation17
Populating tables via INSERT § INSERT INTO . . . SELECT FROM . . .
− Parallel read and write operations CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet( product_key INT NOT NULL, product_name VARCHAR(150), Quantity INT, order_method_en VARCHAR(90) )
STORED AS parquetfile;
-- source tables do not need to be in Parquet format insert into big_sales_parquetSELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en
FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb,sls_order_method_dim meth
WHEREpnumb.product_language='EN' AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_numberAND meth.order_method_key=sales.order_method_key
and sales.quantity > 5500;
§ INSERT INTO . . . VALUES(. . . ) − Not parallelized. 1 file per INSERT. Not recommended except for quick tests
CREATE HADOOP TABLE foo col1 int, col2 varchar(10));
INSERT INTO foo VALUES (1, ‘hello’);
© 2016 IBM Corporation18
CREATE . . . TABLE . . . AS SELECT . . .§ Create a Big SQL table based on contents of other table(s)
§ Source tables can be in different file formats or use different underlying storage mechanisms
-- source tables in this example are external (just DFS files) CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat( product_key INT NOT NULL
, product_line_code INT NOT NULL, product_type_key INT NOT NULL, product_type_code INT NOT NULL, product_line_en VARCHAR(90), product_line_de VARCHAR(90) )
as select product_key, d.product_line_code, product_type_key,product_type_code, product_line_en, product_line_defrom extern.sls_product_dim d, extern.sls_product_line_lookup lwhere d.product_line_code = l.product_line_code;
© 2016 IBM Corporation19
SQL capability highlights§ Query operations
− Projections, restrictions− UNION, INTERSECT, EXCEPT− Wide range of built-in functions (e.g. OLAP) − Various Oracle, Netezza compatibility items
§ Full support for subqueries− In SELECT, FROM, WHERE and
HAVING clauses− Correlated and uncorrelated − Equality, non-equality subqueries− EXISTS, NOT EXISTS, IN, ANY,
SOME, etc.§ All standard join operations
− Standard and ANSI join syntax− Inner, outer, and full outer joins− Equality, non-equality, cross join support− Multi-value join
§ Stored procedures, user-defined functions, user-defined aggregates
SELECTs_name,count(*) AS numwait
FROMsupplier,lineitem l1,orders,nation
WHEREs_suppkey = l1.l_suppkeyAND o_orderkey = l1.l_orderkeyAND o_orderstatus = 'F'AND l1.l_receiptdate > l1.l_commitdateAND EXISTS (
SELECT*
FROMlineitem l2
WHEREl2.l_orderkey = l1.l_orderkeyAND l2.l_suppkey <> l1.l_suppkey
)AND NOT EXISTS (
SELECT*
FROMlineitem l3
WHEREl3.l_orderkey = l1.l_orderkeyAND l3.l_suppkey <> l1.l_suppkeyAND l3.l_receiptdate >
l3.l_commitdate)AND s_nationkey = n_nationkeyAND n_name = ':1'
GROUP BY s_nameORDER BY numwait desc, s_name;
© 2016 IBM Corporation20
Power of standard SQL
§ Big SQL executes all 22 TPC-H queries without modification
§ Big SQL executes all 99 TPC-DS queries without modification
§ Big SQL leverages DB2 query rewrite technology for efficient optimization
SELECT s_name, count(*) AS numwaitFROM supplier, lineitem l1, orders, nation
WHERE s_suppkey = l1.l_suppkeyAND o_orderkey = l1.l_orderkeyAND o_orderstatus = 'F'AND l1.l_receiptdate > l1.l_commitdateAND EXISTS (
SELECT *FROM lineitem l2WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)AND NOT EXISTS (
SELECT *FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkeyAND l3.l_suppkey <> l1.l_suppkeyAND l3.l_receiptdate > l3.l_commitdate)
AND s_nationkey = n_nationkeyAND n_name = ':1'
GROUP BY s_nameORDER BY numwait desc, s_name
JOIN(SELECT s_name, l_orderkey, l_suppkey
FROM orders oJOIN
(SELECT s_name, l_orderkey, l_suppkeyFROM nation n
JOIN supplier sON s.s_nationkey = n.n_nationkeyAND n.n_name = 'INDONESIA'
JOIN lineitem lON s.s_suppkey = l.l_suppkey
WHERE l.l_receiptdate > l.l_commitdate) l1ON o.o_orderkey = l1.l_orderkeyAND o.o_orderstatus = 'F') l2
ON l2.l_orderkey = t1.l_orderkey) aWHERE (count_suppkey > 1) or ((count_suppkey=1)AND (l_suppkey <> max_suppkey))) l3
ON l3.l_orderkey = t2.l_orderkey) bWHERE (count_suppkey is null)
OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) cGROUP BY s_nameORDER BY numwait DESC, s_name
SELECT s_name, count(1) AS numwaitFROM
(SELECT s_name FROM(SELECT s_name, t2.l_orderkey, l_suppkey,
count_suppkey, max_suppkeyFROM
(SELECT l_orderkey,count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey
FROM lineitemWHERE l_receiptdate > l_commitdateGROUP BY l_orderkey) t2
RIGHT OUTER JOIN(SELECT s_name, l_orderkey, l_suppkey
FROM(SELECT s_name, t1.l_orderkey, l_suppkey,
count_suppkey, max_suppkeyFROM
(SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey
FROM lineitemGROUP BY l_orderkey) t1
Original Query
Re-written query
© 2016 IBM Corporation21
Query federation = virtualized data accessTransparent§ Appears to be one source§ Programmers don’t need to know how /
where data is stored
Heterogeneous§ Accesses data from diverse sources
High Function§ Full query support against all data§ Capabilities of sources as well
Autonomous§ Non-disruptive to data sources, existing
applications, systems.
High Performance§ Optimization of distributed queries
SQL tools, applications Data sources
Virtualizeddata
© 2016 IBM Corporation22
Federation in practice § Admin enables
federation
§ Apps connect to Big SQL database
§ Nicknames look like tables to the app
§ Big SQL optimizer creates global data access plan with cost analysis, query push down
§ Query fragments executed remotely
Nicknam
e
Nicknam
e
TableCost-based optimizer
WrapperClient library
WrapperClient library
Local + Remote Execution Plans
Remote sources
Federation server (Big SQL)
Native dialect
Connect to bigsql
© 2016 IBM Corporation23
Joining data across sources
© 2016 IBM Corporation24
Agenda
§ Big Data background− Market drivers− Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities
§ Using Hadoop and Big SQL− Create tables / populate with data− Query tables − Explore query federation
§ Using Spark and Big SQL− Query data using Spark SQL− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary § Summary
© 2016 IBM Corporation25
About Spark and Big SQL
§ Easy to query Big SQL (or DB2 LUW) tables through Spark SQL− See link to self-study lab in “Resources” section
§ Follow typical Spark SQL JDBC data source pattern − Identify JDBC driver and connection properties − Load table contents into DataFrame, Spark SQL temporary view− Execute Spark SQL queries − Applies to Big SQL tables in Hive warehouse, HBase, or arbitrary HDFS
directory− Query results can be manipulated via other Spark libraries
§ Technical preview: Launch Spark jobs from Big SQL via UDF
© 2016 IBM Corporation26
Accessing Big SQL data from Spark shell // based on BigInsights tech preview release that includes Spark 2.1// Launch shell with --driver-class-path pointing to JDBC driver .jar
// read data from Big SQL table “t1” and load into a DataFrameval sampleDF = spark.read.format("jdbc")
.option("url”,"jdbc:db2://yourHost.com:32051/BIGSQL")
.option("dbtable",”yourSchema.t1")
.option("user", "yourID").option("password", "yourPassword")
.load()
// display full contentssampleDF.show()
// create a Spark SQL temporary view to query sampleDF.createOrReplaceTempView("v1")
// query the view and display the resultssql("select col1, col3 from v1 where col2 > 100 limit 15”).show()
© 2016 IBM Corporation27
Technical preview: launch Spark jobs from Big SQL
§ Spark jobs can be invoked from Big SQL using a table UDF abstraction
§ Example: Call the SYSHADOOP.EXECSPARK built-in UDF to kick off a Spark job that reads a JSON file stored on HDFS
SELECT *FROM TABLE(SYSHADOOP.EXECSPARK(
language => 'scala',class =>
'com.ibm.biginsights.bigsql.examples.ReadJsonFile',uri =>
'hdfs://host.port.com:8020/user/bigsql/demo.json',card => 100000)) AS doc
WHERE doc.country IS NOT NULL
© 2016 IBM Corporation28
Agenda
§ Big Data background− Market drivers− Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities
§ Using Hadoop and Big SQL− Create tables / populate with data− Query tables − Explore query federation
§ Using Spark and Big SQL− Query data using Spark SQL− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summaryhttps://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/
§ Summary
© 2016 IBM Corporation29
WhatisTPC-DS?§ TPC=TransactionProcessingCouncil
− Non-profit corporation (vendor independent)− Definesvarious industrydrivendatabasebenchmarks….DS=DecisionSupport− Modelsamulti-domain datawarehouseenvironment forahypothetical retailer
Retail Sales Web Sales Inventory Demographics Promotions
Multiple scalefactors:100GB,300GB,1TB,3TB,10TB,30TBand100TB
99 Pre-Defined Queries
Query Classes:
Reporting Ad HocIterativeOLAP
Data Mining
© 2016 IBM Corporation30
100TBTPC-DSisBIG data
© 2016 IBM Corporation31
Benchmark Environment: IBM “F1” Spark SQL Cluster
§ 28 Nodes Total (Lenovo x3640 M5)
§ Each configured as:• 2 sockets (18 cores/socket)• 1.5 TB RAM• 8x 2TB SSD
§ 2 Racks− 20x 2U servers per rack (42U racks)
§ 1 Switch, 100GbE, 32 ports MellanoxSN2700
© 2016 IBM Corporation32
PERFORMANCE
SPARK SQL 2.1 HADOOP-DS @ 100TB: AT A GLANCE
WORKING QUERIES
COMPRESSION
60%SPACE SAVED
WITH PARQUET
Spark SQL completes more TPC-DS queries than any other
open source SQL engine for Hadoop@ 100TB Scale
© 2016 IBM Corporation33
Query Compliance Through the Scale Factors§ SQLcomplianceisimportantbecauseBusinessIntelligencetoolsgeneratestandardSQL
− Rewritingqueries ispainfulandimpactsproductivity
§ SparkSQL2.1canrunall99TPC-DSqueriesbutonlyatlowerscalefactors
§ SparkSQLFailures@100TB:− 12runtimeerrors− 4timeout (>10hours)
SparkSQL
§ BigSQLhasbeensuccessfullyexecutingall99queries sinceOct2014
§ IBMistheonlyvendor thathasprovenSQLcompatibilityatscalefactorsupto100TB
BigSQL
© 2016 IBM Corporation34
Big SQL is 3.2X faster than Spark 2.1(4 Concurrent Streams)
BigSQL@99queries stilloutperforms SparkSQL@83queries
© 2016 IBM Corporation35
PERFORMANCEBig SQL 3.2x faster
HADOOP-DS @ 100TB: AT A GLANCE WORKING QUERIES
CPU (vs Spark)
Big SQL uses 3.7x less CPU
I/O (vs Spark)
Big SQL reads 12x less data Big SQL writes 30x less data
COMPRESSION
60%SPACE SAVEDWITH PARQUET
AVERAGE CPU USAGE76.4%MAX I/O THROUGHPUT: READ 4.4 GB/SECWRITE 2.8 GB/SEC
© 2016 IBM Corporation36
Recommendation: Right Tool for the Right Job
Machine Learning
Simpler SQL
Good Performance
Ideal tool for BI Data Analysts and production workloads
Ideal tool for Data Scientists and discovery
Big SQL Spark SQL
Migrating existing workloads to Hadoop
Security
Many Concurrent Users
Best Performance
Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster
© 2016 IBM Corporation37
Agenda
§ Big Data background− Market drivers− Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities
§ Using Hadoop and Big SQL− Create tables / populate with data− Query tables − Explore query federation
§ Using Spark and Big SQL− Query data using Spark SQL− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary § Summary
© 2016 IBM Corporation38
Summary
§ Big SQL = easy path for DB2 professionals to work with Big Data
§ Runs on popular Hadoop platforms from IBM, Hortonworks
§ Integrates with Spark
§ Compatible with DB2 and ISO SQL
§ Brings high-performance, enterprise-grade query engine to popular open source Big Data platforms
© 2016 IBM Corporation39
Want to learn more?
§ Hadoop Devhttps://developer.ibm.com/hadoop/
§ Labs: Big SQL intro, Spark / Big SQL, . . . https://developer.ibm.com/hadoop/docs/getting-started/tutorials/big-sql-hadoop-tutorial/
§ 100TB benchmarkhttps://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/
§ This presentationhttps://www.slideshare.net/CynthiaSaracco/presentations
© 2016 IBM Corporation40
Supplemental
© 2016 IBM Corporation41
Big SQL architecture§ Head (coordinator / management) node
− Listens to the JDBC/ODBC connections− Compiles and optimizes the query− Optionally store user data in DB2-compatible table (single node only). Useful for some reference data.
§ Big SQL worker processes reside on compute nodes (some or all)§ Worker nodes stream data between each other as needed§ Workers can spill large data sets to local disk if needed
− Allows Big SQL to work with data sets larger than available memory
© 2016 IBM Corporation42
CPU Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams
SparkSQLusesalmost3x moresystemCPU.ThesearewastedCPUcycles.
AverageCPUUtilization:76.4%
AverageCPUUtilization:88.2%
© 2016 IBM Corporation43
I/O Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams
Spark SQL required
3.6X more reads9.5X more writes
Big SQL can drive peak I/O nearly
2X more