using bigbench to compare hive and spark

Using BigBench to compare Hive and Spark

Alejandro Montero, Nicolas Poggi

1

What is BigBench (TPCx-BB1)?

• Specification-based benchmark with an open-source implementation, proposed as the first Big Data benchmark standard.

• BigBench covers all major Big Data characteristics• Volume -> Scale factor.• Velocity -> Table refresh.• Variety -> Data type disparity.

• Extension of TPC-DS.• Borrows 10 pure QL queries from TCP-DS. • Added BigData tables and use cases: Machine Learning, Natural Language Processing, ...

• Can support multiple implementations, multiple BigData engines and table formats.

• Can execute multiple parallel streams.

• Defines scale factors for data.• Tested: 100 GB.

2[1]: http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-bb_v1.2.0.pdf

http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-bb_v1.2.0.pdf

BigBench – Overview

3

Unstructured DataStructured Data

Semi-Structured Data

Marketprice Items

Sales

Web Page Customers

Reviews

Web Log

Workload:

• 14 Pure QL queries.• 10 Borrowed from TPC-DS.

• 4 Queries with MapReduce pre-processing.• 7 Natural Language Processing Queries.• 5 Machine Learning Queries.

BigBench v1.2 – Reference Implementation

HDFS

Hive Metastore

MapReduce Tez Spark

Yarn

Hive Spark SQL

Mahout ML Custom Spark MLlibApplication

SQL Engine

Table Metastore

Execution Engine

Filesystem

Benchmarked systems:

• Hive + MapReduce + Mahout• Hive + MapReduce + Spark_MLlib• Hive + Tez + Mahout• Hive + Tez + Spark_MLlib• Spark SQL + Mahout• Spark SQL + Spark_MLlib• Spark 2 SQL + Mahout

Work in progress:

• Hive 2• Spark 2 SQL + Spark_MLlib

The cluster – HDInsight PaaS

5

Model HDInsight D4v3

# Head nodes 2

# Working nodes 4

# Zookeeper nodes 3

CPU Intel(R) Xeon(R) CPU E5-2673 v38 x 2,4 GHz cores

RAM 28 GB

HDFS Remote

Software HortonWorks Data Platform 2.5

Spark config 1 executor/working node3 cores/executor

Pure QL

6

6223

16011848

1457

0

1000

2000

3000

4000

5000

6000

7000

Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1

Tim

e in

sec

on

ds

Average of three executions using 100 GB Scale Factor

Query 12 CPU behavior

7

Tez Spark 1.6.2 Spark 2.0.2


Custom Reducers

8

2815

1122

16291466

0

500

1000

1500

2000

2500

3000

Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1

Tim

e in

sec

on

ds


Natural Language Processing

9

5004

1100

2289

1913

0

1000

2000

3000

4000

5000

6000

Hive_MR Hive_Tez Spark_1.6.2 Spark_2.0.1

Tim

e in

sec

on

ds


Machine Learning

10

3550

1613

3769

1898

4045

1937

3390

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Hive_MR +Mahout

Hive_MR +Spark_ml

Hive_tez +Mahout

Hive_Tez +Spark_ML

Spark_1.6.2 +Mahout

Spark_1.6.2 +Spark_ML

Spark_2.0.1 +Mahout

Tim

e in

sec

on

ds


11

Aggregated Results

6223 6223

1601 1623 1848 1951 1457

2815 2815

1122 11371629 1489

1466

5004 5004

1100 1082

2289 23561913

3550

1613

3769

1898

4045

1937 3390

17592

15655

7592

5740

9811

77338226

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

Hive_MR +Mahout

Hive_MR +Spark_ML

Hive_tez +Mahout

Hive_Tez +Spark_ML

Spark_1.6.2 +Mahout

Spark_1.6.2 +Spark_ML

Spark_2.0.1 +Mahout

Tim

e in

sec

on

ds

Pure-QL Custom Reducers NLP ML Total


Conclusions

• Hive on Tez greatly improves SQL performance over Hive on MapReduce.• It is also faster than Hive on spark 1.

• Hive on spark 2 is slightly faster.

• The Spark implementation is based on hive…

• Spark MLlib has an excellent performance over Mahout.

• Best production combination: Apache Tez for SQL + Spark MLlib forMachine Learning.

12

Thanks, questions?

Follow up / feedback : [email protected]

Using BigBench to compare Hive and Spark

13

using bigbench to compare hive and spark

Data & Analytics