benchmarking hadoop - which hadoop sql engine leads the herd

Which Hadoop SQL Engine Leads the Herd?

Stewart Tate STSM, Chief Designer, IBM Big Data Cluster

Lead Architect BigInsights Performance

IBM Silicon Valley Laboratory ∙ San Jose ∙ California

[email protected]

mailto:[email protected]

2 © 2013 IBM Corporation

Acknowledgements and Disclaimers Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

© Copyright IBM Corporation 2014. All rights reserved.

— U.S. Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo, ibm.com, BigInsights, and Big SQL are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at •“Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

•TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council

•Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.

•Other company, product, or service names may be trademarks or service marks of others.

2


Our journey begins…

• Evaluating three SQL engines in major Hadoop Distributions

IBM BigInsights 3.0.0.1 Big SQL v3

Cloudera (CDH5) Impala 1.4

Hortonworks (HDP 2.1) Hive 0.13

• Workload

Simulated Enterprise Retail Operation using 99 Queries

Hadoop-DS based on TPC-DS benchmark


Three Identical Hadoop Clusters…

Management Node

One x3650 M4 BD Two E5-2680 v2 2.8GHz 10-core

128GB RAM, 1866MHz

2TB 3.5” HDD

Dual-port 10GbE

RHEL 6.4

EXT4/HDFS/Parquet/ORC

Data Nodes

Sixteen x3650 M4 BD Two E5-2680 2.8GHz 10-core

128GB RAM, 1866MHz

Ten 2TB 3.5” HDD

Four 120GB 3.5” SSD

Dual-port 10GbE

RHEL 6.4

EXT4/HDFS/Parquet/ORC

Using

Lenovo Servers designed for Hadoop


Benchmark Back Office…


Live access to Audited & Real-time results

http://app.insightibm.com


Which SQL Engine Leads the Herd?

View results from each Hadoop Distro via our mobile app

http://app.insightibm.com


Big SQL – Runs 100% of the queries

Key points

Impala & Hive require many

queries to be re-written, some

significantly

Owing to various restrictions,

some queries could not be re-

written or failed at run-time

Re-writing queries in a

benchmark scenario where

results are known is one thing

– doing this against real

databases in production is

another


Hadoop-DS benchmark – Single user performance @ 10TB

Big SQL is 3.6x faster than Impala and 5.4x faster than Hive 0.13

for single query stream using 46 common queries


Hadoop-DS benchmark – multi-user performance @ 10TB

With 4 streams, Big SQL is 2.1x faster than Impala and 8.5x faster than Hive 0.13

for 4 query streams using 46 common queries

**See Speaker notes for disclaimer


Hadoop-DS benchmark – Big SQL with 99 queries @ 30TB

Big SQL completed 4 concurrent query streams @30TB in 1.8x time of a single query stream

Big SQL

completed

396 queries in

only 1.8x time

of 99 queries


Thank you!


About TPC-DS Queries • The queries are diverse, and many are complex

• Reflecting real business needs – a random sample:

Find customers returning items more frequently than normal (q1)

States with customers most ammenable to premium priced offers (q6)

List key metrics for unadvertised in-store promotions by demographic (q7)

Identify similar customers purchasing through multiple sales outlets (q10)

Find customers shifting purchasing habits to the web (q11)

Key measures for catalog sales fulfilled from an alternate warehouse (q16)

Find frequently sold items and the circumstances under which repeat sales take place (q23)

Understand the products and retail locations where items are likely to be return and subsequently re-purchased via the catalog (q29)

Display customers making significant local purchases comparing to buying potential based on dependents and vehicles owned (q34)


Hadoop-DS benchmark

• Aim: To provide the fairest and most meaningful comparison of SQL over Hadoop solutions so far

• Hadoop-DS benchmark is based on the TPC-DS* benchmark. Key deviations:

No data maintenance or data persistence phases

• Not possible across all vendors

Uses a common query set across all solutions

• The sub-set of queries that all vendors can successfully execute at that scale factor

• Queries are not cherry picked

• It is the most complete TPCDS like benchmark executed to date

• Worked with TPC certified auditor to review the benchmark

• Is analogues to porting a relational workload to SQL on Hadoop

Bringing Order out of Chaos


Big SQL 3.0 – Key Performance Features

• Advanced re-write and cost based optimizer backed by decades of IBM R&D

• Optimized HDFS readers for different storage formats

Native readers

• Self tuning memory manager

Optimize memory allocation between Big SQL consumers depending on the workload

• Informational Constraints

• Advanced Statistics

• Resource sharing and WLM

benchmarking hadoop - which hadoop sql engine leads the herd

Technology

ibm logo

ibm products

use of ibm software

current list of ibm

copyright ibm corporation

ibm trademarked terms

trademarks of cloudera

common law trademarks