can the elephants handle the nosql...

25
CAN THE ELEPHANTS HANDLE THE CAN THE ELEPHANTS HANDLE THE CAN THE ELEPHANTS HANDLE THE CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT? NOSQL ONSLAUGHT? NOSQL ONSLAUGHT? NOSQL ONSLAUGHT? by SRIDHAR REDDY VORUGANTI SRIDHAR REDDY VORUGANTI SRIDHAR REDDY VORUGANTI SRIDHAR REDDY VORUGANTI CSU ID: 2607043 CSU ID: 2607043 CSU ID: 2607043 CSU ID: 2607043

Upload: others

Post on 23-Jun-2020

22 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

CAN THE ELEPHANTS HANDLE THE CAN THE ELEPHANTS HANDLE THE CAN THE ELEPHANTS HANDLE THE CAN THE ELEPHANTS HANDLE THE

NOSQL ONSLAUGHT?NOSQL ONSLAUGHT?NOSQL ONSLAUGHT?NOSQL ONSLAUGHT?

by

SRIDHAR REDDY VORUGANTISRIDHAR REDDY VORUGANTISRIDHAR REDDY VORUGANTISRIDHAR REDDY VORUGANTI

CSU ID: 2607043CSU ID: 2607043CSU ID: 2607043CSU ID: 2607043

Page 2: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

ABSTRACTABSTRACTABSTRACT

• Traditional DBMSs under attack.

• NoSQL vs. SQL.

• Result (evaluation).

Page 3: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

WHAT WE DISCUSS??WHAT WE DISCUSS??WHAT WE DISCUSS??

Page 4: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

INTRODUCTIONINTRODUCTIONINTRODUCTION…………

• The database community is currently at an unprecedented and exciting inflection point.

• RDBMSs are no longer the only viable alternative for data-driven applications.

• At the other end of the big data application spectrum are analytical decision support workloads that are characterized by complex queries on massive amounts of data.

• The results are shown for the sole purpose of providing relative comparisons for this paper, and should not be compared to official benchmark results.

Page 5: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

BACKGROUNDBACKGROUNDBACKGROUND…………

• Parallel Data Warehouse (PDW)

• Hive

• MongoDB

Page 6: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

• Parallel database system.

• Two types of nodes-compute and control.

• Data-horizontally partitioned.

• DMS-shuffling data between nodes.

• Post-processing and re-integration by control node.

Page 7: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

• Open-source data warehouse.

• HDFS-data storage.

• HiveSQL.

• Multiple data storage formats.

Page 8: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

• Open-source NoSQL database.

• Collections-Documents.

• No need of schema.

• Supports Auto-partitioning technique.

• Supports replica sets.

Page 9: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

EVALUATIONEVALUATIONEVALUATION…………

• Evaluation of RDBMS and a NoSQL system

• We use TPC-H to evaluate Microsoft’s PDW and Hive.

• Compare MongoDB with Microsoft SQL Server using YCSB benchmark.

Page 10: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

HARDWARE CONFIGURATIONHARDWARE CONFIGURATIONHARDWARE CONFIGURATION…………

• 1Gbit HP Procurve Ethernet switch with 16nodes.

• Each node with 2.13 GHz, 32 GB of main memory, and 10 SAS 10K RPM 300GB hard drives.

• When evaluating PDW and Hive, we used eight disks to store the data.

• YCSB experiments-eight nodes were used as servers.

Page 11: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

SOFTWARE CONFIGURATIONSOFTWARE CONFIGURATIONSOFTWARE CONFIGURATION…………

• Hive and Hadoop

• PDW

• MongoDB (Mongo-AS)

Page 12: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

• Hive 0.7.1 and Hadoop 0.20.203

• RCFile format instead of text files

• JVM size 2GB.

Page 13: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

• PDW– Version AU3

– Maximum 24GB memory.

• MongoDB– Version 1.8.2

– “Global lock” for write.

Page 14: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

HIVE VS. PDWHIVE VS. PDWHIVE VS. PDW…………

• Workload Description

• Data Layout

• Data Preparation and Load Times

• Experimental Evaluation

Page 15: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

DATA LAYOUT…

� Hive-Partitions and buckets

� PDW-Partitions and Replication

Page 16: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

Data preparation steps

• Generate TPC-H dataset• Hive table for each TPC-H table• Load data in two phases

• Data loaded to HDFS• Data converted to RCFile

Hive PDW

• TPC-H is generated on landing node• Specify schema and tables• Text files split into multiple chunks

• Chunks loaded to nodes

Page 17: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

EXPERIMENTAL EVALUATION…

Page 18: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

QUERIESQUERIESQUERIES…………

• Performance Analysis– Query 5

– Query 19

• Scalability Analysis– Query 1

– Query 22

Page 19: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

• Query 5(joins customer, orders, lineitem, supplier, nation and region)

• Query 19(joins lineitem, part)

Page 20: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

• Query 1

– Scans ‘lineitem’

• Query 22

– Scans customer table

– 4 sub-queries

Page 21: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

MONGODB VS. SQL SERVERMONGODB VS. SQL SERVERMONGODB VS. SQL SERVER…………

• Workload Description– YCSB benchmark

• Read heavy and Read only

• Experimental Evaluation– YCSB benchmark

• Update heavy, Read latest and Short ranges

Page 22: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

YCSB BENCHMARKYCSB BENCHMARKYCSB BENCHMARK…………

Page 23: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

CONCLUSIONS AND FUTURE WORKCONCLUSIONS AND FUTURE WORKCONCLUSIONS AND FUTURE WORK…………

• Popular alternatives.

• the TPC-H benchmark and the YCSB benchmark.

• Our results find that the relational systems continue to provide a significant performance advantage over their NoSQL counterparts, but the NoSQL alternatives are competitive in some cases.

• Expand SQL and NoSQL systems and revisit the performance differences in a few years.

Page 24: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven

REFERENCESREFERENCESREFERENCES…………

• http://hadoop.apache.org/

• http://mongodb.org/

• http://tpc.org/tpch/

Page 25: CAN THE ELEPHANTS HANDLE THE NOSQL ONSLAUGHT?cis.csuohio.edu/~sschung/cis611/CIS611SridharSQLOnSlaught.pdf · • RDBMSs are no longer the only viable alternative for data-driven