performance comparison between apache hive and oracle sql for big data
TRANSCRIPT
Performance Comparison Between Apache Hive and Oracle SQL for Big Data
AnalysisPresented by:- Santosh Kumar Dash |M.Tech CSE, Utkal University
ASE| Tata Consultancy Services Limited
8th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2016)
Contents• Introduction• Motivation and Objective• Proposed Methodology• Data Set Description• Experimental Results• Conclusion and Future Scope• References
Introduction
What is Bigdata?
Bigdata Applications
What is ORACLE SQL?
What is Apache Hive?
Motivation• Many researchers have tried to perform Big Data
Analytics using traditional methods which resulted in poor performances due to memory constraints. • Hence, we are motivated to explore the suitability of
Apache Hive as a distributed database for faster retrieval in comparison to popular Oracle SQL approach.
Objective• Performance Analysis of Oracle SQL w.r.t. time• Performance Analysis of Apache Hive w.r.t. time.• Performance Analysis of Mean processing time between
Apache Hive and Oracle SQL.
Proposed MethodologyThe diagram represents the Proposed Model for the experiment.
Data Set Description• Online Video
Characteristics and Transcoding Time • Record Linkage
Comparison Patterns • 3D Road Network (North
Jutland, Denmark)• Rate (Health-Insurance-
Market place)
Data Set Name Rows Columns
Video 168286 11
Record 5749132 12
Road 434874 4
Rate 13,000000 23
Column Names of DatasetsVideo:- YouTube video id, duration, bitrate (total in Kbits), bitrate(video bitrate in Kbits), height(in pixels), width(in pixels), frame rate, estimated framerate, codec, category, and direct video link.
Record:- id_1, id_2, cmp_fname_c1, cmp_fname_c2,cmolname_c1, cmp_lname_c2, cmp_sex, cmp_bd, cmp_bm, cmp_by, cmp_plz and is_match
Column Names of DatasetsRoad:- OSM_ID, LONGITUDE, LATITUDE, and ALTITUDE
Rate:- BusinessYear, StateCode, IssuerId,Source_Name,Version_Num,ImportDat,Issuer_Id2,FederalTIN,RateEffectiveDate,RateExpiraon Date, PlanId , Rating AreaId, Tobacco, Age, Individual Rate, Individual Tobacco Rate, Couple, Primary Subscriber And One Dependent, Primary Subscriber And Two Dependents, Primary Subscriber And Three Or More Dependents, Couple And One Dependent, Couple And Two Dependents, couple and Three Or More Dependents, and RowNumber
Experimental Results
Queries Description StatementsQuery 1 Retrieving Unique column using
DISTINCTRetrieving unique Output
RecordsQuery 2 Retrieving Records from a given
dataset using ORDER BY for general Sorting
Sorting
Query 3 Retrieving Records Using ORDER BY and DESC for Backward Sorting
Sorting Backward
Query 4 Using COUNT and GROUP BY for Retrieving records and their count.
Grouping with Counting
Query 5 Using MAX aggregate function for retrieving MAXIMUM value from a record
Maximum Value
Performance Comparison of Video dataset
Query 1 Query 2 Query 3 Query 4 Query 50
10
20
30
40
50
60
70
80
Hive Oracle SQL
Performance Comparison of Record dataset
Query 1 Query 2 Query 3 Query 4 Query 50
10
20
30
40
50
60
70
Hive Oracle SQL
Performance Comparison of Road dataset
Query 1 Query 2 Query 3 Query 4 Query 50
5
10
15
20
25
30
Hive Oracle SQL
Performance Comparison of Road dataset
Query 1 Query 2 Query 3 Query 4 Query 50
200
400
600
800
1000
1200
1400
1600
Hive Oracle SQL
Mean Processing Time of All Datasets
Video Road Record Rate0
100
200
300
400
500
600
700
800
900
Hive Oracle SQL
Conclusion APACHE HIVE
• In large data sets, Apache hive is very efficient.• Queries with Group By,
Order By and aggregate function are taking more time as compared to a retrieving entire dataset.• average time is more if the
number of rows is less
ORACLE SQL
• In Small scale datasets, Oracle SQL performs better. • Queries with Group By,
Order By and aggregate function are taking less time as compared to a retrieving entire dataset.• average time is less if the
number of rows is less
Future work• In future, we will take large-scale (In TB) datasets and
do analysis on both Apache Hive and Oracle SQL for the performance test.
References1. Chawda RK , Big Data and Advanced Analytics
Tools.Symposium on Colossal Data Analysis and Networking (CDAN).(2016)
2. Garg V, Optimization of Multiple Queries for Big Data with Apache Hadoop/Hive. 2015 Int Conf Comput Intell Commun Networks, pp. 938–941.(2015)
3. Gruenheid A, Omiecinski E, Mark L Query Optimization Using Column Statistics in Hive Categories and Subject Descriptors.(2016)
4. Haryono GP, Zhou Y (2016) Profiling apache HIVE query from runtime logs. Int Conf Big Data Smart Comput BigComp pp. 61–68.(2016)
References5. Kaisler S, Armour F, Espinosa JA, Money W, Big Data:
Issues and Challenges Moving Forward. 2013 46th Hawaii Int Conf Syst Sci.pp. 995–1004. (2013)
6. Rotsnarani Sethy, Mrutyunjaya Panda, Big Data Analysis Using Hadoop: A survey. IJARCSSE pp.1153–1157.(2015)
7. Thusoo A, Sarma J Sen, Jain N, Shao Z, Chakka P, Zhang N, Antony S, Liu H, Murthy R ,Hive - A petabyte scale data warehouse using Hadoop. In: Proc. - Int. Conf. Data Eng. pp 996–1005.(2010)
8. Loshin D, Chapter 7 - Big Data Tools and Techniques. pp 61–72 (2013)
References9. Hive Architecture.
https://cwiki.apache.org/confluence/display/Hive/Design.
10.Introduction to Oracle Database. https://docs.oracle.com/database/121/CNCPT/intro.htm#CNCPT001.
11.Online Video Characteristics and Transcoding Time Dataset Data Set. https://archive.ics.uci.edu/ml/datasets.html.(2015)
12.Record Linkage Comparison Patterns Data Set. https://archive.ics.uci.edu/ml/datasets.html.(2011)
References13.3D Road Network (North Jutland, Denmark) Data Set.
https://archive.ics.uci.edu/ml/datasets.html.(2013) 14.Rate Data Set https://www.kaggle.com/hhsgov/health-
insurance-marketplace.(2015)15.5v’s of Big data http://bigdata.black/featured/what-is-
big-data/16.Big data applications http://image.slidesharecdn.com/