sql on hadoop 100tb tpc-ds benchmark

© 2017 IBM Corporation

A Performance Study:

SQL-on-Hadoop with TPC-DS queries (Hadoop-DS)

Analytics Performance

© 2017 IBM Corporation2

Acknowledgements and Disclaimers

Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

© Copyright IBM Corporation 2017. All rights reserved.

— U.S. Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo, ibm.com, BigInsights, and Big SQL are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at

▪“Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

▪TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council

▪Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.

▪Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.

▪Other company, product, or service names may be trademarks or service marks of others.


What is TPC-DS?▪ TPC = Transaction Processing Council

Non-profit corporation (vendor independent)

Defines various industry driven database benchmarks…. DS = Decision Support

Models a multi-domain data warehouse environment for a hypothetical retailer

Retail Sales Web Sales Inventory Demographics Promotions

Multiple scale factors: 100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB

99 Pre-Defined

Queries

Query Classes:

Reporting Ad HocIterative

OLAP

Data

Mining


ANALYTICAL SQL ON HADOOP?

What’s the Best Solution for


Radiant Advisors: Sponsored by Teradata (Q2 2016)Presto, Impala, Hive and Spark SQL (pre-2.0)

Thinking of moving BI workloads

from Data Warehouse to Hadoop?


Publisher Date Product TPC-DS Queries Data Vol

Cloudera Sept 2016 Impala 2.6 on AWS

Claims 42% more performant than

AWS Redshift

70 query subset 3TB

Cloudera August

2016

Impala 2.6

Claims 22% faster for TPC-DS

than previous version

17 queries

referenced

Not specified

Cloudera April 2016 Impala 2.5

Claims 4.3x faster for TPC-DS

than previous version

24 query subset 15TB *1

Hortonworks July 2016 Hive 2.1 with LLAP

Claims 25x faster for TPC-DS than

Hive 1.2

15 query subset 1TB

Latest Benchmarks Direct from Cloudera / Hortonworks SQL

are not much better.

https://blog.cloudera.com/blog/2016/09/apache-impala-incubating-vs-amazon-redshift-s3-integration-elasticity-agility-and-cost-performance-benefits-on-aws/

http://blog.cloudera.com/blog/2016/08/bi-and-sql-analytics-with-apache-impala-incubating-in-cdh-5-8-3x-faster-on-secure-clusters/

https://blog.cloudera.com/blog/2016/04/apache-impala-incubating-in-cdh-5-7-4x-faster-for-bi-workloads-on-apache-hadoop/

http://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/


SPARK RUNS ALL 99 QUERIES

But there is good news…


IBM Leadership in Spark SQL and ML

Major focus areas include

Spark SQL and ML

https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12326761

Statistics as of February 1, 2017


IBM Shared Experiences running 99 TPC-DS queries (Oct 2016)

@ Spark Summit Brussels

10 TB

Scale Factor


WHAT WOULD IT TAKE TO RUN 100 TB

Spark 2.1 shows continued improvement….

IBM delivers the most complete benchmark by

any vendor for SQL on Hadoop with

10X more data


100TB TPC-DS is BIG data


Benchmark Environment: IBM “F1” Spark SQL Cluster

▪ 28 Nodes Total (Lenovo x3640 M5)

▪ Each configured as:

• 2 sockets (18 cores/socket)

• 1.5 TB RAM

• 8x 2TB SSD

▪ 2 Racks

20x 2U servers per rack (42U racks)

▪ 1 Switch, 100GbE, 32 ports Mellanox SN2700


PERFORMANCE

SPARK SQL 2.1 HADOOP-DS @ 100TB: AT A GLANCE

WORKING QUERIES

COMPRESSION

60%SPACE SAVED

WITH

PARQUET

Spark SQL completes more

TPC-DS queries than any other

open source SQL engine for Hadoop

@ 100TB Scale


WHAT CAN WE COMPARE IT TO?

But… is this a good result?

© 2017 IBM Corporation

Big SQL also runs TPC-DS queries…The following benchmark results used the same hardware

as Spark SQL F1 Cluster using Big SQL v4.3 Technical Review


Query Compliance Through the Scale Factors▪ SQL compliance is important because Business Intelligence tools generate standard SQL

Rewriting queries is painful and impacts productivity

▪ Spark SQL 2.1 can run all 99 TPC-DS queries but only at lower scale factors

▪ Spark SQL Failures @ 100 TB:

12 runtime errors

4 timeout (> 10 hours)

Spark SQL

▪ Big SQL has been successfully executing all 99 queries since Oct 2014

▪ IBM is the only vendor that has proven SQL compatibility at scale factors up to 100TB

Big SQL


CPU Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams

Spark SQL uses almost 3x more system CPU.These are wasted CPU cycles.

Average CPU Utilization: 76.4%

Average CPUUtilization: 88.2%


I/O Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams

Spark SQL required

3.6X more reads

9.5X more writes

Big SQL can

drive peak I/O nearly

2X more


Big SQL is 3.2X faster than Spark 2.1(4 Concurrent Streams)

Big SQL @ 99 queries stilloutperforms Spark SQL @ 83 queries


A LOT OF POTENTIAL

And the best part,… Big SQL still has


▪ Big SQL only actively using ~ 1/3rd of memory

More memory could be assigned to bufferpools and sort space etc…

Big SQL could be even faster !!!

▪ Spark SQL is doing a better job at utilizing the available memory, but consequently has less room for improvement via tuning

Big SQL

Spark SQL

Memory Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams


BIG SQL + SPARK IS A GREAT

COMBINATION

But this is not about Big SQL vs. Spark


Recommendation: Right Tool for the Right Job

Machine Learning

Simpler SQL

Good Performance

Ideal tool for BI Data Analysts

and production workloads

Ideal tool for Data Scientists

and discovery

Big SQL Spark SQL

Migrating existing

workloads to Hadoop

Security

Many Concurrent Users

Best Performance

Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster


HDFS

Big SQL Head Node

Big SQL

Worker

Big SQL

Worker

Big SQL

Worker

Big SQL

Worker

Spark Exec. Spark Exec. Spark Exec. Spark Exec.

= Fast data transfer over

shared memory

Big SQL – The ONLY engine with Deep Integration with Spark


Summary: IBM is investing on Big SQL and SparkSQL

▪ Only Big SQL completes all 99 queries with concurrency at 100TB

▪ Big SQL completes the workload:

3.2x faster than Spark SQL

With less than 3x the CPU resources

With 11x fewer read ops and 24x fewer write ops

▪ IBM is investing massively in SparkSQL

▪ To learn more:

https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-

spark-sql-at-100tb/

https://www.youtube.com/watch?v=M5zqykmEu9U

https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/

https://www.youtube.com/watch?v=M5zqykmEu9U

sql on hadoop 100tb tpc-ds benchmark

Technology