performance comparison between apache hive and oracle sql for big data

Performance Comparison Between Apache Hive and Oracle SQL for Big Data

AnalysisPresented by:- Santosh Kumar Dash |M.Tech CSE, Utkal University

ASE| Tata Consultancy Services Limited

8th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2016)

Contents• Introduction• Motivation and Objective• Proposed Methodology• Data Set Description• Experimental Results• Conclusion and Future Scope• References

Introduction

What is Bigdata?

Bigdata Applications

What is ORACLE SQL?

What is Apache Hive?

Motivation• Many researchers have tried to perform Big Data

Analytics using traditional methods which resulted in poor performances due to memory constraints. • Hence, we are motivated to explore the suitability of

Apache Hive as a distributed database for faster retrieval in comparison to popular Oracle SQL approach.

Objective• Performance Analysis of Oracle SQL w.r.t. time• Performance Analysis of Apache Hive w.r.t. time.• Performance Analysis of Mean processing time between

Apache Hive and Oracle SQL.

Proposed MethodologyThe diagram represents the Proposed Model for the experiment.

Data Set Description• Online Video

Characteristics and Transcoding Time • Record Linkage

Comparison Patterns • 3D Road Network (North

Jutland, Denmark)• Rate (Health-Insurance-

Market place)

Data Set Name Rows Columns

Video 168286 11

Record 5749132 12

Road 434874 4

Rate 13,000000 23

Column Names of DatasetsVideo:- YouTube video id, duration, bitrate (total in Kbits), bitrate(video bitrate in Kbits), height(in pixels), width(in pixels), frame rate, estimated framerate, codec, category, and direct video link.

Record:- id_1, id_2, cmp_fname_c1, cmp_fname_c2,cmolname_c1, cmp_lname_c2, cmp_sex, cmp_bd, cmp_bm, cmp_by, cmp_plz and is_match

Column Names of DatasetsRoad:- OSM_ID, LONGITUDE, LATITUDE, and ALTITUDE

Rate:- BusinessYear, StateCode, IssuerId,Source_Name,Version_Num,ImportDat,Issuer_Id2,FederalTIN,RateEffectiveDate,RateExpiraon Date, PlanId , Rating AreaId, Tobacco, Age, Individual Rate, Individual Tobacco Rate, Couple, Primary Subscriber And One Dependent, Primary Subscriber And Two Dependents, Primary Subscriber And Three Or More Dependents, Couple And One Dependent, Couple And Two Dependents, couple and Three Or More Dependents, and RowNumber

Experimental Results

Queries Description StatementsQuery 1 Retrieving Unique column using

DISTINCTRetrieving unique Output

RecordsQuery 2 Retrieving Records from a given

dataset using ORDER BY for general Sorting

Sorting

Query 3 Retrieving Records Using ORDER BY and DESC for Backward Sorting

Sorting Backward

Query 4 Using COUNT and GROUP BY for Retrieving records and their count.

Grouping with Counting

Query 5 Using MAX aggregate function for retrieving MAXIMUM value from a record

Maximum Value

Performance Comparison of Video dataset

Query 1 Query 2 Query 3 Query 4 Query 50

10

20

30

40

50

60

70

80

Hive Oracle SQL

Performance Comparison of Record dataset


10

20

30

40

50

60

70

Hive Oracle SQL

Performance Comparison of Road dataset


5

10

15

20

25

30

Hive Oracle SQL

Performance Comparison of Road dataset


200

400

600

800

1000

1200

1400

1600

Hive Oracle SQL

Mean Processing Time of All Datasets

Video Road Record Rate0

100

200

300

400

500

600

700

800

900

Hive Oracle SQL

Conclusion APACHE HIVE

• In large data sets, Apache hive is very efficient.• Queries with Group By,

Order By and aggregate function are taking more time as compared to a retrieving entire dataset.• average time is more if the

number of rows is less

ORACLE SQL

• In Small scale datasets, Oracle SQL performs better. • Queries with Group By,

Order By and aggregate function are taking less time as compared to a retrieving entire dataset.• average time is less if the

number of rows is less

Future work• In future, we will take large-scale (In TB) datasets and

do analysis on both Apache Hive and Oracle SQL for the performance test.

References1. Chawda RK , Big Data and Advanced Analytics

Tools.Symposium on Colossal Data Analysis and Networking (CDAN).(2016)

2. Garg V, Optimization of Multiple Queries for Big Data with Apache Hadoop/Hive. 2015 Int Conf Comput Intell Commun Networks, pp. 938–941.(2015)

3. Gruenheid A, Omiecinski E, Mark L Query Optimization Using Column Statistics in Hive Categories and Subject Descriptors.(2016)

4. Haryono GP, Zhou Y (2016) Profiling apache HIVE query from runtime logs. Int Conf Big Data Smart Comput BigComp pp. 61–68.(2016)

References5. Kaisler S, Armour F, Espinosa JA, Money W, Big Data:

Issues and Challenges Moving Forward. 2013 46th Hawaii Int Conf Syst Sci.pp. 995–1004. (2013)

6. Rotsnarani Sethy, Mrutyunjaya Panda, Big Data Analysis Using Hadoop: A survey. IJARCSSE pp.1153–1157.(2015)

7. Thusoo A, Sarma J Sen, Jain N, Shao Z, Chakka P, Zhang N, Antony S, Liu H, Murthy R ,Hive - A petabyte scale data warehouse using Hadoop. In: Proc. - Int. Conf. Data Eng. pp 996–1005.(2010)

8. Loshin D, Chapter 7 - Big Data Tools and Techniques. pp 61–72 (2013)

References9. Hive Architecture.

https://cwiki.apache.org/confluence/display/Hive/Design.

10.Introduction to Oracle Database. https://docs.oracle.com/database/121/CNCPT/intro.htm#CNCPT001.

11.Online Video Characteristics and Transcoding Time Dataset Data Set. https://archive.ics.uci.edu/ml/datasets.html.(2015)

12.Record Linkage Comparison Patterns Data Set. https://archive.ics.uci.edu/ml/datasets.html.(2011)

References13.3D Road Network (North Jutland, Denmark) Data Set.

https://archive.ics.uci.edu/ml/datasets.html.(2013) 14.Rate Data Set https://www.kaggle.com/hhsgov/health-

insurance-marketplace.(2015)15.5v’s of Big data http://bigdata.black/featured/what-is-

big-data/16.Big data applications http://image.slidesharecdn.com/

performance comparison between apache hive and oracle sql for big data

Technology