using bigbench to compare hive and spark
TRANSCRIPT
Using BigBench to compare Hive and Spark
Alejandro Montero, Nicolas Poggi
1
What is BigBench (TPCx-BB1)?
• Specification-based benchmark with an open-source implementation, proposed as the first Big Data benchmark standard.
• BigBench covers all major Big Data characteristics• Volume -> Scale factor.• Velocity -> Table refresh.• Variety -> Data type disparity.
• Extension of TPC-DS.• Borrows 10 pure QL queries from TCP-DS. • Added BigData tables and use cases: Machine Learning, Natural Language Processing, ...
• Can support multiple implementations, multiple BigData engines and table formats.
• Can execute multiple parallel streams.
• Defines scale factors for data.• Tested: 100 GB.
2[1]: http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-bb_v1.2.0.pdf
BigBench – Overview
3
Unstructured DataStructured Data
Semi-Structured Data
Marketprice Items
Sales
Web Page Customers
Reviews
Web Log
Workload:
• 14 Pure QL queries.• 10 Borrowed from TPC-DS.
• 4 Queries with MapReduce pre-processing.• 7 Natural Language Processing Queries.• 5 Machine Learning Queries.
BigBench v1.2 – Reference Implementation
HDFS
Hive Metastore
MapReduce Tez Spark
Yarn
Hive Spark SQL
Mahout ML Custom Spark MLlibApplication
SQL Engine
Table Metastore
Execution Engine
Filesystem
Benchmarked systems:
• Hive + MapReduce + Mahout• Hive + MapReduce + Spark_MLlib• Hive + Tez + Mahout• Hive + Tez + Spark_MLlib• Spark SQL + Mahout• Spark SQL + Spark_MLlib• Spark 2 SQL + Mahout
Work in progress:
• Hive 2• Spark 2 SQL + Spark_MLlib
The cluster – HDInsight PaaS
5
Model HDInsight D4v3
# Head nodes 2
# Working nodes 4
# Zookeeper nodes 3
CPU Intel(R) Xeon(R) CPU E5-2673 v38 x 2,4 GHz cores
RAM 28 GB
HDFS Remote
Software HortonWorks Data Platform 2.5
Spark config 1 executor/working node3 cores/executor
Pure QL
6
6223
16011848
1457
0
1000
2000
3000
4000
5000
6000
7000
Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1
Tim
e in
sec
on
ds
Average of three executions using 100 GB Scale Factor
Query 12 CPU behavior
7
Tez Spark 1.6.2 Spark 2.0.2
Average of three executions using 100 GB Scale Factor
Custom Reducers
8
2815
1122
16291466
0
500
1000
1500
2000
2500
3000
Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1
Tim
e in
sec
on
ds
Average of three executions using 100 GB Scale Factor
Natural Language Processing
9
5004
1100
2289
1913
0
1000
2000
3000
4000
5000
6000
Hive_MR Hive_Tez Spark_1.6.2 Spark_2.0.1
Tim
e in
sec
on
ds
Average of three executions using 100 GB Scale Factor
Machine Learning
10
3550
1613
3769
1898
4045
1937
3390
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Hive_MR +Mahout
Hive_MR +Spark_ml
Hive_tez +Mahout
Hive_Tez +Spark_ML
Spark_1.6.2 +Mahout
Spark_1.6.2 +Spark_ML
Spark_2.0.1 +Mahout
Tim
e in
sec
on
ds
Average of three executions using 100 GB Scale Factor
11
Aggregated Results
6223 6223
1601 1623 1848 1951 1457
2815 2815
1122 11371629 1489
1466
5004 5004
1100 1082
2289 23561913
3550
1613
3769
1898
4045
1937 3390
17592
15655
7592
5740
9811
77338226
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
Hive_MR +Mahout
Hive_MR +Spark_ML
Hive_tez +Mahout
Hive_Tez +Spark_ML
Spark_1.6.2 +Mahout
Spark_1.6.2 +Spark_ML
Spark_2.0.1 +Mahout
Tim
e in
sec
on
ds
Pure-QL Custom Reducers NLP ML Total
Average of three executions using 100 GB Scale Factor
Conclusions
• Hive on Tez greatly improves SQL performance over Hive on MapReduce.• It is also faster than Hive on spark 1.
• Hive on spark 2 is slightly faster.
• The Spark implementation is based on hive…
• Spark MLlib has an excellent performance over Mahout.
• Best production combination: Apache Tez for SQL + Spark MLlib forMachine Learning.
12
Thanks, questions?
Follow up / feedback : [email protected]
Using BigBench to compare Hive and Spark
13