sql on hadoop 100tb tpc-ds benchmark
TRANSCRIPT
© 2017 IBM Corporation
A Performance Study:
SQL-on-Hadoop with TPC-DS queries (Hadoop-DS)
Analytics Performance
© 2017 IBM Corporation2
Acknowledgements and Disclaimers
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
© Copyright IBM Corporation 2017. All rights reserved.
— U.S. Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, ibm.com, BigInsights, and Big SQL are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at
▪“Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
▪TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council
▪Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.
▪Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.
▪Other company, product, or service names may be trademarks or service marks of others.
© 2017 IBM Corporation3
What is TPC-DS?▪ TPC = Transaction Processing Council
Non-profit corporation (vendor independent)
Defines various industry driven database benchmarks…. DS = Decision Support
Models a multi-domain data warehouse environment for a hypothetical retailer
Retail Sales Web Sales Inventory Demographics Promotions
Multiple scale factors: 100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB
99 Pre-Defined
Queries
Query Classes:
Reporting Ad HocIterative
OLAP
Data
Mining
© 2017 IBM Corporation4
ANALYTICAL SQL ON HADOOP?
What’s the Best Solution for
© 2017 IBM Corporation5
Radiant Advisors: Sponsored by Teradata (Q2 2016)Presto, Impala, Hive and Spark SQL (pre-2.0)
Thinking of moving BI workloads
from Data Warehouse to Hadoop?
© 2017 IBM Corporation6
Publisher Date Product TPC-DS Queries Data Vol
Cloudera Sept 2016 Impala 2.6 on AWS
Claims 42% more performant than
AWS Redshift
70 query subset 3TB
Cloudera August
2016
Impala 2.6
Claims 22% faster for TPC-DS
than previous version
17 queries
referenced
Not specified
Cloudera April 2016 Impala 2.5
Claims 4.3x faster for TPC-DS
than previous version
24 query subset 15TB *1
Hortonworks July 2016 Hive 2.1 with LLAP
Claims 25x faster for TPC-DS than
Hive 1.2
15 query subset 1TB
Latest Benchmarks Direct from Cloudera / Hortonworks SQL
are not much better.
© 2017 IBM Corporation7
SPARK RUNS ALL 99 QUERIES
But there is good news…
© 2017 IBM Corporation8
© 2017 IBM Corporation9
IBM Leadership in Spark SQL and ML
Major focus areas include
Spark SQL and ML
https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12326761
Statistics as of February 1, 2017
© 2017 IBM Corporation10
IBM Shared Experiences running 99 TPC-DS queries (Oct 2016)
@ Spark Summit Brussels
10 TB
Scale Factor
© 2017 IBM Corporation11
WHAT WOULD IT TAKE TO RUN 100 TB
Spark 2.1 shows continued improvement….
IBM delivers the most complete benchmark by
any vendor for SQL on Hadoop with
10X more data
© 2017 IBM Corporation12
100TB TPC-DS is BIG data
© 2017 IBM Corporation13
Benchmark Environment: IBM “F1” Spark SQL Cluster
▪ 28 Nodes Total (Lenovo x3640 M5)
▪ Each configured as:
• 2 sockets (18 cores/socket)
• 1.5 TB RAM
• 8x 2TB SSD
▪ 2 Racks
20x 2U servers per rack (42U racks)
▪ 1 Switch, 100GbE, 32 ports Mellanox SN2700
© 2017 IBM Corporation14
PERFORMANCE
SPARK SQL 2.1 HADOOP-DS @ 100TB: AT A GLANCE
WORKING QUERIES
COMPRESSION
60%SPACE SAVED
WITH
PARQUET
Spark SQL completes more
TPC-DS queries than any other
open source SQL engine for Hadoop
@ 100TB Scale
© 2017 IBM Corporation15
WHAT CAN WE COMPARE IT TO?
But… is this a good result?
© 2017 IBM Corporation
Big SQL also runs TPC-DS queries…The following benchmark results used the same hardware
as Spark SQL F1 Cluster using Big SQL v4.3 Technical Review
© 2017 IBM Corporation17
Query Compliance Through the Scale Factors▪ SQL compliance is important because Business Intelligence tools generate standard SQL
Rewriting queries is painful and impacts productivity
▪ Spark SQL 2.1 can run all 99 TPC-DS queries but only at lower scale factors
▪ Spark SQL Failures @ 100 TB:
12 runtime errors
4 timeout (> 10 hours)
Spark SQL
▪ Big SQL has been successfully executing all 99 queries since Oct 2014
▪ IBM is the only vendor that has proven SQL compatibility at scale factors up to 100TB
Big SQL
© 2017 IBM Corporation18
CPU Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams
Spark SQL uses almost 3x more system CPU.These are wasted CPU cycles.
Average CPU Utilization: 76.4%
Average CPUUtilization: 88.2%
© 2017 IBM Corporation19
I/O Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams
Spark SQL required
3.6X more reads
9.5X more writes
Big SQL can
drive peak I/O nearly
2X more
© 2017 IBM Corporation20
Big SQL is 3.2X faster than Spark 2.1(4 Concurrent Streams)
Big SQL @ 99 queries stilloutperforms Spark SQL @ 83 queries
© 2017 IBM Corporation24
A LOT OF POTENTIAL
And the best part,… Big SQL still has
© 2017 IBM Corporation25
▪ Big SQL only actively using ~ 1/3rd of memory
More memory could be assigned to bufferpools and sort space etc…
Big SQL could be even faster !!!
▪ Spark SQL is doing a better job at utilizing the available memory, but consequently has less room for improvement via tuning
Big SQL
Spark SQL
Memory Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams
© 2017 IBM Corporation27
BIG SQL + SPARK IS A GREAT
COMBINATION
But this is not about Big SQL vs. Spark
© 2017 IBM Corporation28
Recommendation: Right Tool for the Right Job
Machine Learning
Simpler SQL
Good Performance
Ideal tool for BI Data Analysts
and production workloads
Ideal tool for Data Scientists
and discovery
Big SQL Spark SQL
Migrating existing
workloads to Hadoop
Security
Many Concurrent Users
Best Performance
Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster
© 2017 IBM Corporation29
HDFS
Big SQL Head Node
Big SQL
Worker
Big SQL
Worker
Big SQL
Worker
Big SQL
Worker
Spark Exec. Spark Exec. Spark Exec. Spark Exec.
= Fast data transfer over
shared memory
Big SQL – The ONLY engine with Deep Integration with Spark
© 2017 IBM Corporation30
Summary: IBM is investing on Big SQL and SparkSQL
▪ Only Big SQL completes all 99 queries with concurrency at 100TB
▪ Big SQL completes the workload:
3.2x faster than Spark SQL
With less than 3x the CPU resources
With 11x fewer read ops and 24x fewer write ops
▪ IBM is investing massively in SparkSQL
▪ To learn more:
https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-
spark-sql-at-100tb/
https://www.youtube.com/watch?v=M5zqykmEu9U