bachelor thesis -...
TRANSCRIPT
![Page 1: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/1.jpg)
BACHELOR THESIS
Evaluating Presto as an SQL on HadoopSolution
A Case at Truecaller
Sahir Ahmed2016
Bachelor of ArtsSystems Science
Luleå University of TechnologyDepartment of Computer science, Electrical and Space engineering
![Page 2: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/2.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
EvaluatingPrestoasanSQLonHadoopsolution
acaseattruecaller
SahirAhmed–VT2016
![Page 3: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/3.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
Abstract Truecallerisamobileapplicationwithover200millionuniqueusersworldwide.Everydaytruecallerstoresover1billionrowsofdatathattheyusetoanalyseforimprovingtheirproduct.ThedataisstoredinHadoop,whichisaframeworkforstoringandanalysinglargeamountsofdataonadistributedfilesystem.Inordertobeabletoanalysetheselargeamountsofdatatheanalyticsteamneedsanewsolutionformorelightweight,ad-hocanalysis.ThisthesisevaluatestheperformanceofthequeryenginePrestotoseeifitmeetstherequirementstohelpthedataanalyticsteamattruecallergainefficiency.Byusingadesign-sciencemethodology,Presto’sprosandconsarepresented.Prestoisrecommendedasasolutiontobeusedtogetherwiththetoolstodayforspecificlightweightusecasesforusersthatarefamiliarwiththedatasetsusedbytheanalyticsteam.Othersolutionsforfutureevaluationarealsorecommendedbeforetakingafinaldecision.
Keywords:Hadoop,BigData,Presto,Hive,SQLonHadoop
![Page 4: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/4.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
Table of contents Abstract.............................................................................................................................2
Tableofcontents...............................................................................................................3
Tableoffigures..................................................................................................................5
Tableoftables...................................................................................................................6
Glossary.............................................................................................................................7
1. Introduction................................................................................................................11.1Problemstatement................................................................................................................1
1.1.1Problemdescription...............................................................................................................11.1.2Researchquestions................................................................................................................11.1.3Purpose..................................................................................................................................1
1.2Motivationsforresearch........................................................................................................21.3Delimitations.........................................................................................................................2
2. Background.................................................................................................................32.1Truecaller...............................................................................................................................32.2BigData.................................................................................................................................3
2.2.1WhatisBigData?...................................................................................................................32.2.2DataAnalytics.........................................................................................................................52.2.3DataWarehouse.....................................................................................................................5
2.3Hadoop..................................................................................................................................72.3.1WhatisHadoop......................................................................................................................72.3.2TruecallerandHadoop...........................................................................................................72.3.3MapReduce............................................................................................................................82.3.4HDFS.....................................................................................................................................102.3.5YARN.....................................................................................................................................11
2.3.5.1Slider.............................................................................................................................................132.3.6SQLonHadoop.....................................................................................................................13
2.3.6.1Hive..............................................................................................................................................132.3.6.2Presto...........................................................................................................................................15
3. Literaturereview.......................................................................................................173.1Purposeofliteraturereview................................................................................................173.2Benchmarks.........................................................................................................................17
3.2.1Fast-data-hackathon............................................................................................................173.2.2RenminUniversity................................................................................................................183.2.3Commercialbenchmarks......................................................................................................19
3.2.3.1Pivotal(HAWQ)............................................................................................................................193.2.3.2Cloudera(Impala).........................................................................................................................20
3.3Hivenotmeantforlow-latencyquerying.............................................................................21
4. Researchmethodology..............................................................................................224.1Identifyproblems&Motivate..............................................................................................224.2DefineObjectivesforsolution..............................................................................................234.3Design&Development........................................................................................................23
![Page 5: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/5.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
4.4Demonstration.....................................................................................................................244.5Evaluation............................................................................................................................244.6Communication...................................................................................................................24
5. Result........................................................................................................................255.1Clusterdetails......................................................................................................................255.2Implementation...................................................................................................................255.3Usecasesandlatencyinseconds.........................................................................................255.4Resulttable..........................................................................................................................27
6. Discussion..................................................................................................................286.1Prestoattruecaller..............................................................................................................286.2Usecases.............................................................................................................................286.3Limitations...........................................................................................................................286.4Commercialalternatives......................................................................................................286.5Furtherresearch..................................................................................................................29
7. Conclusion.................................................................................................................30
8. References.................................................................................................................318.1Booksandarticles................................................................................................................318.2Websitesandblogposts......................................................................................................32
9. Appendices................................................................................................................339.1AppendixA–Prestoinstallation..........................................................................................339.2AppendixB–Usecasequeries.............................................................................................38
![Page 6: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/6.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
Table of figures Fig1.Exponentialgrowthofdata(Fishman,2014,July14).................................................4
Fig2.TheBigdataprocessandsubprocesses(Gandomi&Haider,2015)...........................5
Fig3.Starschema(Connolly&Begg,2005)........................................................................6
Fig4.SimplifiedMapReducedataflow(White,2010).........................................................7
Fig5.MapReduceflowwithonereducer(White,2010).....................................................9
Fig6.MapReducedataflowwithmultiplereducers(White,2010)...................................10
Fig7.HDFSArchitecture(Srinivasa&Mupalla,2015)........................................................11
Fig8.ApplicationsrunningonYARN(White,2010)...........................................................12
Fig9.YARNcomponents(ApacheHadoop,2016)..............................................................13
Fig10.HiveArchitecture(Thusooetal.,2009)..................................................................15
Fig11.PrestoArchitecture(Teradata,2016).....................................................................16
Fig12.fast-data-hackathonresults(TreasureDataBlog,2015,March20).........................18
Fig13.RenminUniversityresultsfor100-nodecluster(Chenetal.,2014).........................19
Fig14.Hawq’sspeedupratiovs.Impala(Solimanetal.,2014).........................................19
Fig15.Cloudera’sImpalabenchmark(Cloudera,2016,February11)................................20
Fig.16DesignScienceResearchMethodologyPeffers,Tuunanen,Rothenberger&Chatterjee,2007)..............................................................................................................22
Fig17.Queryexecutiontimeinseconds...........................................................................27
![Page 7: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/7.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
Table of tables Table1.Glossary................................................................................................................7
Table2.Whatisbigdata?(Cervone,2015).........................................................................4
Table3.RDBMScomparedtoMapReduce(White,2010)...................................................8
Table4.Requirements......................................................................................................23
Table5.Clusterdetails......................................................................................................25
![Page 8: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/8.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
Glossary Hadoop OpensourceJavaframeworkfor
distributeddatastorageHDFS HadoopdistributedfilesystemYARN (Yetanotherresourcenegotiator)
resourcemanagerforHadoopRDBMS Relationaldatabasemanagement
systemNode Commoditycomputer(hardware)used
byHDFSCluster Agroupofconnectednodesusedby
HDFSHive SQLdatawarehouseforHadoopMapReduce Programmingmodelforprocessingdata
onHadoopPresto SQLontopofHadoopsolutionSlider Applicationtorunnon-Yarn
applicationsonYARN
Table 1. Glossary
![Page 9: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/9.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
1
1. Introduction Thissectionintroducesthereadertotruecaller,thecompanywherethisstudyisconductedandgivessomebackgroundinformationaboutthecurrentsituationregardingthedatathatisstoredandanalysed.Followingthatinformationabackgroundregardingthedatawarehouseandtoolsusedfordataanalyticswithintheanalyticsteamoftruecallerisexplained.Hadoopisoneofthemaintoolsusedanddescribedonafairlyhigh-level.Truecallerisamobileappusedbyover200millionusersworldwide.Attruecallertoday,thedailyincomingdata(abouthalfabillionrows)isstoredinasystemdistributedfilesystemusingtheopensourceJavaframeworkHadoop.Thecollecteddataisusedbyallthedepartmentsofthecompanyforimprovingtheproductandtakingnewdecisions.Inordertoanalysethelargeamountsofdata,theData-AnalyticsteamislookingforanewlowlatencysolutionthatcanbeusedasacomplementtotheirstandardHadooptools,whichinsomecasesareconsideredtooslow.Thegoalofthisresearchistoidentify,testandevaluateanumberoftoolsthatfitstheteam’srequirementsandcouldpotentiallybeimplemented.
1.1 Problem statement “Runningad-hocqueriesonHiveistooslowfordatadiscovery”–BjörnBrinne,HeadofData-AnalyticsatTruecaller
1.1.1 Problem description Theproblemwithinthedata-analyticsorganisationattruecalleristhattheuseofHiveisnotfastenoughforad-hocqueryingandon-the-flydatadiscovery.TheusersruntheirqueriesinHiveandlongerquerieswithmultiplejoinsandaggregationscantakeuptoseveralhourstoexecute.Afterthequeryhasrun,theusermightnotbesatisfiedwiththeoutcomeandwillhavetoadjustandrerunthequery,sometimesinmultipleiterations.Thismakesthewholeprocessineffectiveandholdsbackthedatascientistsandotherusersfromusingtheircreativitytoaddmorevaluetotheirjobandcompanybecauseofpoorperformanceofthetools.Italsolimitsthemtojustfollowwhatisprioritizedatthattimeandleadstosomeworknotbeingdoneorpushedreallyfarbackinpriorityand.Thiscanalsomaketheanalyticsorganisationlessinnovative.
1.1.2 Research questions 1.HowmuchfasterthanHiveisPrestoontruecaller’sHadoopcluster?2.WhichtypeofusecasesisPrestosuitable/notsuitablefor?
1.1.3 Purpose Thepurposeofthisprojectistoidentify,testandimplementasolutiontosupportdata-discoveryandad-hocqueryingofdatainthetruecallerdatawarehouse.ThesolutionisaimedtocomplementHadoopandwillhopefullybeimplementedandusedbytheanalytics-teamattruecaller.Thereareseveralcommercialandopensourceproductsaresupposedtoprovidethatfunctionalitytotheusers.Thoseproductsoftenvaryinperformanceanddependontheenvironmentthattheyaresetupin.Therefortheaimandpurposeofthisresearchisonlymeanttosolvetheissuesanduse-casesattruecallerandisnotintendedtobeusedasageneralrecommendationorsolutionforothercompaniestorelyupon.However,theoutcomescanprovidethereaderswithanoverviewthatcan
![Page 10: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/10.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
2
becomparedtootherarticlesonthesubject.Themaingoalistoadvicetruecalleronwhichoftheidentifiedsolutionstoimplementanduseasananalyticaltool.
1.2 Motivations for research ThemainmotivationistotestandevaluatenewertechniqueswithinthegrowingBigDatafieldandtocomparethemwiththeindustrystandardtoolHive.Asecondarymotivationistoprovidesupportandhelptothedata-analyticsteamattruecaller,whichisundergoingrapidgrowth,andwishestofindamoresuitabletoolforfasterqueryingandanalysisoftheirdailyincomingdata.
1.3 Delimitations Theprojectwilltakeplacehalftimeunder4monthsandwillbebasedonpre-specifiedusecasesfortruecaller.Onesolutionistobetested.Firstonatestenvironmentlikee.g.avirtualonenodeclusterandasasecondstageintheproductionenvironment.Thesolutionisfreeopensourcetokeepthecostsdown.Thisprojectwillmostlyfocusonhowfastthedataispresentedandwillnotgoindepthonhowthetechniquesworkbehindthescenes,duetothesubjectofInformationSystemsScience.Thedatasetswillbepredefinedandtheaimsotofinallyusethedatawarehouseanddailyincomingdata.
![Page 11: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/11.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
3
2. Background
2.1 Truecaller TruecallerisaSwedishmobileapplicationcompanybasedinStockholm.Theirflagshipapplication;truecallerisdevelopedtoprovideuserswithinformationaboutwantedandunwantedcallsfromunknownnumbers.Bydoingthattruecallerhelpstheusersidentifyingspamcallsandmessages,givingthempossibilitiestoblockthosenumbersandalsosharethisinformationwiththerestoftheuserbase(Truecaller,2016).
Truecallerhastodayover200millionusersandissteadilygrowing.ThelargestusergroupislocatedinIndiawheretherearemorethan100millionusers.InIndiaalone,truecallerhelpsidentifyabout900millioncallsand120millionspamcallseverymonth.Thecompanystoresover2billionnamesandnumbersinitsdatawarehousewhereitislateranalysedforfurtheractions.Workingtogetherwithseveraltelecomprovidershelpstruecallertoreachouttonewusersindifferentcountries(Choudury,2015,October12).
InIndiawheretruecallerisverypopular,thenumberofunwanted/spamcallsisabigproblem.Thegovernmenthasspecialregulationsconcerningspamcalls(Rukmini,2014,December13)andthiscanbeonefactortothepopularityofthetruecallerapplication.
Truecallerstoresover1billiondailyeventsintheirdatawarehouse,wherethisdataislaterusedforanalysis.Theseanalysiscanbebasedonuserdata,applicationperformance,A/Btestingetc.andareusedforfurtherdevelopmentoftheapplication.
2.2 Big Data “Therewere5Exabyte’sofinformationcreatedbetweenthedawnofcivilizationthrough2003,butthatmuchinformationisnowcreatedeverytwodays.”-EricSchmidt,SoftwareEngineerandExecutiveChairmanofGoogle(Walker,2015)
2.2.1 What is Big Data? ThedefinitionofBigDataisoftendiscussedandisapopularandrecurringsubjectwithintheITworld.Somechoosetodefine“What”BigDataiswhileotherschoosetolookatwhatit“Does”(Gandomi&Haider,2015).Thedefinitionisthereforenotcrystalclearanduseof“thethreeV:s(Volume,VarietyandVelocity)”hasemergedandisbeingusedtodescribeBigData.VolumehastodowiththesizeandisoftenthefirstthingthatcomesinmindwhentalkingaboutBigDataseefig.2.WhentalkingaboutsizesinBigData,sizesovermultiplePetabytesoftencomestomind(Walker,2015)Keepinmindthatthesesizesincreaseallthetime.ItincreasesevenfasterthanMoore’sLawthatmeansitwoulddoubleinsizeeverytwoyears.Theincreaseofdataactuallydoublesinsizeaboutevery18months(NationalInstituteofStandardsandTechnology,2015).Varietyconcerningthevaryingtypesofdata,whichshowsthatthemajority(over90%)ofthedataisunstructuredduetothedifferenttypesofdatathatisuploadedallthetimethroughdifferentchannelssuchas,photos,videos,tweetsetc.Structureddataisstoredinrelationaldatabasesintables;thisisnotthecasefordatastoredinHadoop.Velocityreferstothespeedofwhichthedataisproduced,storedandanalysedseefig.1.Overthelatestyearswheresmartphoneshavebecomewidelyused,therateofdatavelocityhasexplodedandtheneedsforreal-timeanalyticsareinhigh-demand.Usingdatafromendusersfromasmartphonesvendorscananalysedatacollectedfromthehandhelddevices
![Page 12: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/12.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
4
suchasgeospatiallocation,phonenumbers,demographicsetc.(Gandomi&Haider,2015).Thesearethethree“main”V:sbutovertimeadditionalV:ssuchasVariability,Value,Validity,Veracityetc.havemadeitintodiscussionindefiningBigDataanddataanalytics(NationalInstituteofStandardsandTechnology,2015).
Fig 1. Exponential growth of data (Fishman, 2014, July 14).
Table 2. What is big data? (Cervone, 2015). HereisanexampleofoneofmanydefinitionsofBigData:“BigDataconsistsofextensivedatasets-primarilyinthecharacteristicsofvolume,variety,velocity,
![Page 13: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/13.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
5
and/orvariability-thatrequireascalablearchitectureforefficientstorage,manipulation,andanalysis.”(NationalInstituteofStandardsandTechnology,2015)
Sowhat“Does”BigDatado,whenisitusedandbywho?Wetakealookatitinnextparagraph.
2.2.2 Data Analytics BigDataisuselessifitisjuststoredandneverused.Themainreasonforstoringallthedataistoanalyseitandsomehowmakeuseofit.Withthehelpofdataanalysis,companiescanmakeimportantdecisionsfortheirprocesses(Walker,2015).Thewaysofanalysingdataisnotsodifferentfromthepast,usingstatisticalmodelsandformulas.Thedifferenceismostlyaboutthesizeofthedatasetsandthespeedofanalysinglargeramountsofdata.Also,withthehelpoftoolssuchasHadoop,companiescanfocusmoreoncausationbecauseoftheavailabilityofdatathatcanbeprovided.Theuseofcausationhelpsustodeterminewhysomethingishappeningwhichleadstomakingbetter(NationalInstituteofStandardsandTechnology,2015).AccordingtoLambrinisandJagadish(2012)(Walker,2015),theprocessforextractinginsightsofdataisdividedintofivestagesthatarepartofthetwosubprocesses,data-managementandanalyticsseefig.3.Thefirstthreestagesarewithinthemorebackendprocessdata-managementandcoversacquiringandstoringthedata,cleaningandpreparingitandthenpresenting/deliveringittotheanalysisprocesswherethedataisactuallyanalysed,interpretedandvisualized(Gandomi&Haider,2015).
Fig 2. The Big data process and sub processes (Gandomi & Haider, 2015).
2.2.3 Data Warehouse Adatawarehouseisadataset/databasedatacollectedfromoneormultipledatasourcesusedtoanalysehistoricaldatatohelporganisationswithdecision-making.In1990BillInmonintroducedthetermdatawarehouse,andin1993hestatedthatdatawarehouseisdefinedas:
- Subject-oriented:Meaningthatthedatafocusesonspecificsubjectsregardingthebusinesse.g.customers,productsetc.
- Integrated:Becausethedataoftencomesfrommultiplesourcesthedatahastobere-structuredtoholdaconsistentpatternofthedatawarehouse.
![Page 14: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/14.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
6
- Timevariant:Thedataisfetchedindifferenttimesanddatesithastobeshownsothatitcanmakesensefordifferenttimeintervalsandtimerelatedmatters.
- Non-volatile:Thedatagrowsincrementallyanddoesnotreplacetheolddatabutgrowssteadilyovertime(Connolly&Begg,2005).
DatawarehousesareoftenconnectedtoOLAP(OnlineAnalyticalProcessingTools)toolsforanalysis,miningandmachinelearning.TheprocesswhenadatawarehouseretrievesnewdatafromitsdatasourcesiscalledETL(extracttransformload).Datawarehousesareoftenmodelledusingastar-schema,whichdiffersfromanormalrelationaldatabasemodel(Jukić,Sharma,Nestorov&Jukić,2015).Astar-schemaisamultidimensionalmodelseefig.4usingfacttablesthatrepresentsabusinessobjecte.g.apurchase.Thefacttablesareconnectedtomultipledimensiontableswithspecificgrainsthattogethermakeupthefactsinthefacttablese.g.aproductdimension,timedimensionetc.Togethertheycanshowwhatproductwaspurchasedatwhattimeanddateetc.Datawarehousescanbemodelledwithdifferentschemastooe.g.snowflake-schema,OLAPcubeetc.(Connolly&Begg,2005).
Fig 3. Star schema (Connolly & Begg, 2005).
![Page 15: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/15.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
7
2.3 Hadoop 2.3.1 What is Hadoop HadoopwascreatedbyDougCuttingbetween2003and2006.CuttingwasworkingonanopensourcesearchenginecalledNutchwherehewasinspiredbyGoogle’stechnologiesGFS(GoogleFileSystem)andMapReducethattogethermadeupaplatformforprocessingadataefficientlyonalargescale(White,2010).FromthosetechnologiesCuttingstarteddevelopingHadoopthatquicklybecameaprioritizedprojectwithinApacheopensourcefoundation.YahoowhosoonhiredCuttingstronglysupportedtheprojectwithhelpandresources.ThenameHadoopcomesfromCuttingsson’syellowtoyelephantandisalsowhythelogoforHadooplooksthewayitdoes.By2008,HadoopwasbeingusedbyseverallargecompanieslikeAmazon,Facebooketc.(Turkington,2013).
HadoopisaJavabasedframeworkfordistributeddatacomputationandstoring.It’slinearscalabilityandcheapinfrastructuremakesiteasytomaintainandupgrade,hencemakesitthemainactorforBigDataprocessing.Hadoopismadeupoftwomaintechnologies,HDFSandMapReduce,whicharecompletelydifferentbutcomplementeachother.ByusingHadoop,largesetsofdatacanbecomputedinparallelonallthenodesofthecluster.Thiswayallcomputingpowerandresourcescanbeutilizedefficiently(White,2010).
2.3.2 Truecaller and Hadoop TruecallerstoresallitsdatainHadoop.HadoopprocessesthedatabyMapReducejobsrunninginparalleloveraclusterofdistributednodesusingHDFS(HadoopDistributedFileSystem).MapReduceisaprogrammingmodelthatsplitsthedataprocessesinaquerywheretheMapstagesortsandpairsthedatainkey-valueformatandthenthedataissummarizedandpresentedintheReducephaseseefig.5.
Fig 4. Simplified MapReduce dataflow (White, 2010).
ThemainadvantageofusingHadoopinsteadofarelationaldatabaseforstoringandanalysingBigDataisbecauseofcostandperformance.HadoopisrelativelycheapbecausethenodesarenormalcommoditycomputersthatinteractinHDFS.Itisalsolinearlyscalablewhichmeansthatwhenitisneededtoaddmorecapacityorspace,addingnewnodestotheclustercaneasilysolvetheissue(White,2010).WhenusingaRDMSthedatahastobestoredondiskandinmemory,whichcanbeproblematicwhenitcomestolargeamountsofdatae.g.Petabytes.Therefore,spreadingtheworkloadonmultiplenodeshelpssolvethisissue(Chenetal.,2014).However,therearesomeotherdifferenceswhereHadooplosesagainsttheRDBMSseetable.1.ThetypesofqueriesthataremainlysupposedtobeexecutedinHadoopareMapReducebatchjobs.Hadoopwasnotmeanttobeused
![Page 16: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/16.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
8
forinteractiveon-the-flyad-hocqueriesordatadiscovery.RDBMSarebetterforad-hocqueriesandprovideamuchmoreinteractiveuserinterfacethanHadoopwasintendedtodo.
Table 3. RDBMS compared to MapReduce (White, 2010).
InrecentyearsmanynewSQLonHadoopenginesthatletstheuserhaveanSQLinterfaceandlanguagetostructureandquerythedatawhileusingMapReduce,havebeendevelopedtosupportandprovidead-hocquerying.HiveisnowmoreorlessanindustrystandardtoolforSQLonHadoop(White,2010)(Chenetal.,2014).Hiveisusedasthemaindatawarehousetoolattruecaller.
2.3.3 MapReduce MapReduceisaprogrammingmodelusedforprocessingbigdatadistributedandinparallelacrosstheHadoopcluster.MapReduceisoneofthetwomainpartsofHadoopandcanbewritteninmanydifferentprogramminglanguages.Asnotedearlier,theMapReducemodelconsistsoftwodifferentphasesthe;
a. Map,whichtakestheinputdataand,b. Reduce,whichoutputsthefinalresult.
InMapstagetheinputdataisdividedintosplits,whichhaveonemaptaskforeachsplit.Themapfunctionthenrunsforeachrecordinthesplit,andYARN,whichisHadoop’sresourcemanagerschedulesthetasksanddividesthemamongthenodesinthecluster.IfajobfailsYARNseestoitthatitisre-runonanothermachine.HavingtherightsizeofsplitshelpstheperformanceoftheMapReducejob,becausewhenthesizeofthesplitsaretoobigthejobwillbedividedintofewermachinesandwilltakelongertoprocess,whileiftherearemanynodesandthesplitsaresmallthejobcanbedividedamongmanymachineswhichcanprocesstheminparallel(Maitrey&Jha,2015).Thisissomethingtheuserhastosetuprightfortheusedenvironment.Normally,asplitwillbe128MB,likethesizeofaHDFSblock,whichisthedefaultsize.Ablockandasplitisnotthesamething.Ablockisaharddivisionofthedata,whichispre-set.Becausearecordofdatadoesnotalwaysfitinoneblockorcanoverlaptoanotherblock,theMapReducejobusessplitskeeptrackofthedifferentrecordstoprocess.ThenumberofsplitsforaMapReducejobcanthereforenotbepredefinedbutisentirelydependentofthesizeofthedatatobeprocessedinthejob(White,2010).
![Page 17: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/17.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
9
Forbestresults,HadooptriestorunthemapphaseonthesamenodeaswheretheinputdataisstoredonHDFS.Thisway,networkbandwidthisnotwasted.IfthatnodeisbusyrunningothermaptasksHadoopchoosesanothernodeintheclusterwiththereplicateddata.Thenameforthatprocessiscalleddatalocalityoptimization.
Asmentioned,foreveryinputsplittherewillbeonemapperandthemapperoutputsakey-valuepairasaresult.Thesekey-valueoutputsarewrittentothelocaldiskonthenodeandnotonHDFS.Thisisbecausetheresultofthemaptaskisonlyfortemporaryuse,andoncethefinalresultofthereducetaskiscompleted,thatdatawillbedeletedfromthelocaldisk.Thereducephasedoesnothavethedatalocalityoptimizationbecauseareducerreceivesdatafrommultiplemapperswhichhavedatafromdifferentnodes.Afterthemapphasethekey-valuepairforeveryrecordisgroupedbythekeywithalistwiththevaluesbeforefilteredtotherightoutputinthereducephase(Turkington,2013).ThisistheprocedureforaMapReducejobwithonlyonereducerseefig.6.Ifdealingwithalargesetofdata,itisadvisabletousemorethanonereducertoavoidperformanceissuesandbottlenecks.
Thephasebetweenmapandreduceiscalledtheshufflephase.Theshufflephaseisveryimportantbecauseitiswherethekey-valuepairsaresortedandmergedtogether,creatingthekeywiththelistofvaluesthatareusedinthereducephasebeforethefinaloutput(White,2010).Itcomesintofocuswhenthejobsareusingmorethanonereducerseefig.7.
Fig 5. MapReduce flow with one reducer (White, 2010).
![Page 18: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/18.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
10
Whenusingmorethanonereducer,themapperspartitionsthedata,onepartitionforeveryreducer.Thisway,intheshufflephasewhendataissortedandmergedtherightkey-valueoutputgoestotherightreducer.Withoutthepartitioningtheshufflephasewouldnotknowwhichreducerthedatashouldbesentto,whichwouldresultinhavingsamekeysspreadoutondifferentreducerswhichwouldoutputaresultforeachreducer.Thiswouldresultinanoutputthatwillbeprocessedineffectively,andwouldquitelikelybewrong(Maitrey&Jha,2015).
Fig 6. MapReduce data flow with multiple reducers (White, 2010).
2.3.4 HDFS TheHadoopdistributedfilesystemorHDFSisthesecondandotherimportantpartofHadoop,togetherwithMapReduce.HDFSisthefilesystemusedonaHadoopclusterandisusedforstoringlargesetsofdataonaclusterofdistributednodes.OneofthekeyfeaturesofHDFSisthatitisrunoncommodityhardwarenodes,whichmeansthatisnormalpricedhardwarethatcanbeboughtinmultiplevendors,andnotspeciallydesignedforHadoop.HDFSisdesignedtokeeprunningifonenodefailsbecausethatissomethingthathappensfrequentlyandmustbeaccountedfor.
Thedataprocessingisstreamprocessedwhichmeansitiswrite-once,read-many-times.Thisconceptismeantthatthedatasetsusedarewrittenonceandthenusedforprocessingovertimeandtherefortheimportanceofreadingthedatasetsisbiggerthanlowlatencyfortheoutputresults.Thehighthroughputofthedatacomeswiththecostoflatency.ThelimitforthenumberoffilesstoredinHDFSisdependantofthememorysizeofthemasternode(s),alsoknownasthenamenode.ThenamenodestorestheMetadataofthefilesysteminitsmemory(location,typeetc.)andisusedtolocatethefilesonthedifferentworkernodes(White,2010).
![Page 19: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/19.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
11
Thefilesystemworksbyusingnamenodesanddatanodes(master/slave)seefig.8.InitiallythereusedtobeonenamenodeinHDFS,butinlaterversionsofHadoop,theuseofmultiplenamenodeshavebeenintroducedforstabilityandoff-loadpurposes.Todaynormally,thereareprimaryandsecondarynamenode(whichareinstandbymode).ThenamenodemanagesthefilesystemtreeandmetadataandkeepstrackofallthefilesanddirectoriesontheHDFS,whicharestoredontherestofthenodes,whicharethedatanodes.ThelimitationforstoringfilesinHDFSisthememoryonthenamenode,whichholdsthereferencetoallthefilesinthesystem.Aslongasthememoryonthenamenodeisnotfull,nodesanddiskspacecanbeaddedtothecluster,whichmakesitscalable.ThedatanodesaretheworkersintheHDFSsystemandallthedataonHDFSisstoredinblocksonthedatanodes.Assoonasachangeoccurstodataonadatanodethatisfoundinthenamespace,itisrecordedonthenamenode(Maitrey&Jha,2015).Theuseofstoringdatainblockshastheadvantagesthatfileslargerthanthedisksonthenodesdoesnotbecomeaproblem,sinceitcanbedistributedamongmultipleblocksinHDFS.Theblocksarereplicatedanddistributedontheclusterforbackupandfailureconcerns.Thedefaultreplicationfactoris3,whichmeansthatablockwillbereplicatedandspreadthreetimesonthecluster.ThismakesHDFSveryreliablewhenanodefailsorgoesdown(Apache,2016)(Srinivasa&Mupalla,2015).
Fig 7. HDFS Architecture (Srinivasa & Mupalla, 2015)
2.3.5 YARN ApacheYARNor(YetAnotherResourceNegotiator)isHadoop’sresourcemanagementsystemandwasincludedwithHadoop2.0.YARNisusedforbothresourcemanagementacrosstheclusterandalsofornodemanagement,managingtheresourcesforeverysinglenodebyusingcontainers.The
![Page 20: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/20.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
12
containersareusedforrunningapplicationsontopofYARNandhavetheirownallocatedresourcessuchasCPUandmemory.ApplicationscanbebuiltonYARNusingdifferentAPIs,andYARNisoftennotsomethingthattheenduserhastobeconcernedabout.ApplicationsthatuseYARNasanintermediatelayerbetweenthemandthestoragelayerarecalledYARNapplicationsseefig.9.ThiswayYARNmakesiteasiertorunapplicationsontheclusterandseestoitthattheresourcesareallocatedright(White,2010).
Fig 8. Applications running on YARN (White, 2010) YARNconsistsof:
a. Theresourcemanagerb. Nodemanager,andc. Theapplicationmaster.
Theresourcemanagerlooksaftertheclustersresourceswhilethenodemanagerhandlesthenodesandcontainers.Theapplicationmaster(oneperapplication)negotiatesresourcesfromtheresourcemanagersoandtogetherwiththenodemanagersithelpstheapplicationrunandcomputetasks.
Theresourcemanageritselfconsistsof:
a. Thescheduler,andb. TheApplicationManager.
Thescheduleristhecomponentthatschedulesandallocatesalltheresourcesneededforrunningapplicationsbasedonwhatresourcesareneededandwhichcanbefoundinthecontainers.ApplicationManagerensuresthattheapplicationgetstherightcontainersfromtheschedulersothattheapplicationcanrun.Whilerunningitisalsoresponsibleforrestartingwhenfailuresoccur,andtotrackandmonitorthestatusoftheapplicationseefig.10(ApacheHadoop,2016).
![Page 21: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/21.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
13
Fig 9. YARN components (Apache Hadoop, 2016).
2.3.5.1 Slider SliderisaYARNapplicationdevelopedtorunnon-YARNapplicationsonYARNandtosimplifythisprocess.Withslider,differentusers,usingdifferentinstancesoftheapplication,canrunanapplicationonYARN.Applicationscanalsobestopped,killedandrestartedetc.Anexampleofanon-yarnapplicationthatcanberunwithslideristheHadoopqueryenginePresto(ApacheSliderIncubator,2016)(White,2010).
2.3.6 SQL on Hadoop SQL(StructuredQueryLanguage)hasbeenusedforanalysingdatainrelationaldatabasesforalongtimeandisanindustrywidestandardlanguagewithinBusinessIntelligencetoo.InordertohelpuserswithoutsoftwareengineeringskillstouseHadooptoanalyseandqueryitsdata,differentSQLontopofHadoopsolutionshavebeencreated.Moreandmorecompetingsolutionsemergenowandthentotrytogivetheuserstheeasiestandfastestsolutiononthemarket(Turkington,2013).AccordingtoT.T.MaposaandM.Seth(2015)thefutureofSQLonHadoopisverybrightbecauseofthesimplicityandwideuserrangeofSQLandthepoweroftheHDFS,togetherformingaverysoughtaftersolution(Maposa&Sethi,2015).
2.3.6.1 Hive HiveisanopensourcedatawarehousingsolutionforHadoopandhasanSQLlikequeryinglanguageHiveQL.HivewasdevelopedatFacebookin2007andisaverybroadlyusedtoolthathasbecomeanindustrywidestandard.HivecompilestheHiveQLstatementsandintoMapReducejobsandreturnstheresultstotheuserinaSQLlikemanner.TheHiveQLlanguagehasmanyofthestandardSQL
![Page 22: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/22.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
14
functionsanduserscanaddcustomizedfunctions(UDFs)toaddmorefunctionality.Primitives,strings,arrays,mapsandstructsaremanyofthedatatypessupportedinthetables,whichisoneofmanyreasonsforHive’spopularity.Hivecanberunbothinteractivelylikeanormaldbmse.g.MySQLandthroughbatchjobsinthebackgroundorafteraschedule.ThestandardinterfacesarecommandlineandawebGUIcalledhue.
ThedatamodelofhiveisorganizedintableswithacorrespondingdirectoryinHDFS,partitionsthataredividedintosubdirectoriesinthetabledirectory,andbuckets,whicharethenextlevelofpartitioningperpartitionforfasterprocessingofdata.
ThearchitectureofHiveconsistsofseefig.11:
a. Theclientforuserinteractionb. TheHiveServer2(Thrift)serverthatinterpretsthedifferentexternallanguagesagainstHive.c. TheMetastorethatisthesystemcataloguethatstorestheMetadataaboutallthe
databases,tablesandpartitionscreatedbyinHive.d. TheDriverhandlestheHiveQLstatementduringcompilationandexecutionandeverystep
betweentoandfromtheclientgoesthroughtheDriver.e. TheCompilerthattranslatestheHiveQLstatementfromtheDriverintoaplanthatsimply
speakingconsistsoftheMapReducejobsthataresubmittedtothe(Hadoop)executionengine.(Thusooetal.,2009)(White,2010)(Turkington,2013).
![Page 23: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/23.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
15
Fig 10. Hive Architecture (Thusoo et al., 2009).
2.3.6.2 Presto PrestoisanSQLenginethatalsowasdevelopedbyFacebookanditsmainpurposeislow-latencyad-hocquerying.PrestousesANSISQLassyntax.ItcanrunonHadoopbutalsoonmanyotherdatabaseengines.WhenconnectedtoHadoop,PrestousestheHiveMetastoretofindthedataonHDFS.ThemaindifferencebetweenPrestoandHiveisthatPrestodoesnotprocessthequeriesusingMapReducebutinsteadallocatesallthetaskstotheworkers(nodes)memory,andschedulesnewtaskstotheworkersonceataskisfinished(Teradata,2016).ThismakesPrestomoresuitableforinteractivequerying,especiallywhentheworkersrunoutofmemorybecauseofthesizeofthedataprocessedinaquery.ThearchitectureisbuiltoftheclientthatsendsthequerytoacoordinatorthatparsestheSQLstatement,plansandschedulestheexecution.Theexecutionisthenpipelinedanddistributedtotheworkerslocatedclosestthedata.Theexecutionrunsmultiplestagessimultaneouslyanddataisstreamedbetweenallthestagesonlyusingmemoryforlow-latencyoutput(Presto,2016).
![Page 24: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/24.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
16
Fig 11. Presto Architecture (Teradata, 2016) .
![Page 25: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/25.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
17
3. Literature review
3.1 Purpose of literature review TherehavebeenmanySQLonHadoopbenchmarkscarriedout,totrytofigureoutwhichoneofthemanyprojectstogofor.Manyofthesebenchmarksaredonebyspecificvendorstryingtomarkettheirownproduct.Forthatreasontheyarenotalwaystrustworthywhenlookingtoimplementasolutionatanorganisation.Thevaryingresultsoftendependonthetestedinfrastructure,architectureandoptimizationsmadeforthespecificsolutions.Anotherfactorisalsowhenintimethetestswereconducted.Theopensourceprojectsareinconstantdevelopmenttoimproveandbecomemarketleaders,andnewversionsofthesolutionwillthereforebemoreoptimizedthanthepreviousones.Thiswillbeclearerafterreadingtheresultsfromthedifferentbenchmarkingattemptsthatwillbepresented.Mostofthetestedenginesinthebenchmarksbelowwillnotbeincludedinthebenchmarkingattruecallerbutwillshowthediversityofresults.ThereforIthinkitisimportanttoshowthedifferencesintheresultsandhowtheseresultscannotbetheonlysourceforchoosingandtrustingasolutionbyonlyreadingabenchmark.OnethingthatiscommoninthebenchmarksinthischapteristhatthattheyallbeatHiveinperformancebecausetheydon’tuseMapReducefordataprocessing,butmostlyprocessthedatainmemory.IhaveonlychosenPrestofromthemanyenginesthatarementionedinthissectionbecauseitmetalltherequirementsthattruecallerhad.
3.2 Benchmarks 3.2.1 Fast-data-hackathon Theallegro.techblochreportsresultsfromthefast-data-hackatonwhereagroupofuserscomparedvariousSQLonHadoopsolutionsona4nodetestclusterinJune2015.Theyused11differentqueriesintheirtests(AllegroTech,2015,June25).Thetestedsolutionswere:
- HiveonTez- Presto- Impala- Drill- Spark
![Page 26: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/26.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
18
Fig 12. fast-data-hackathon results (Treasure Data Blog, 2015, March 20).
IntheresultsaboveImpalaanddrillweretheclearwinnerswhenitcomestoexecutiontimebutaccordingtothegroupconductingtheteststheybothlackedsomefunctionalitythatcouldbefoundinothersolutionssuchasHiveandPresto.
3.2.2 Renmin University In2014attheRenminUniveristyinBeijing,ChinaagroupofresearcherstestedfivedifferentSQLonHadoopsolutionstocomparethemandbecauseofHivebeingthestandardtoolandnotbeingefficientenoughforinteractivequerying.Theyused11differentquerieswithdifferentcomplexityon25,50and100nodeclusters.Thesolutionstestedwere:
- Impala- Presto- Hive(toshowdifference)- Stinger- Shark
![Page 27: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/27.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
19
Fig 13. Renmin University results for 100-node cluster (Chen et al., 2014).
Theresultsshownotonlythedifferentresponsetimes,butalsothatsomequerieswerenotabletorunonallsolutionsexceptHive.Accordingtotheteststheresearchersconcludedthatnoneofthesolutionswerematureenoughtobeusedonadailybasis(Chenetal.,2014).
3.2.3 Commercial benchmarks
3.2.3.1 Pivotal (HAWQ) PivotalthatdevelopedtheSQLonHadoopsolutionHAWQbenchmarkedtheirproductagainstCloudera’sImpala,PrestoandStingerona8nodeclusterinJune2015.
![Page 28: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/28.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
20
Fig 14. Hawq’s speed up ratio vs. Impala (Soliman et al., 2014).
TheresultsinthegraphshowshowHAWQbeatsImpalainthemajorityofthequeriesruninthetestcases(Solimanetal.,2014)(Pivotal,2014,June25).ItisworthhighlightingthatImpalahasbeentheclearwinnerwhenitcomestospeedinallthegraphsabove.
3.2.3.2 Cloudera (Impala) InFebruary2016,ClouderareleasedtheirlatestImpalabenchmarkcomparingImpalatoHiveonTezandSparkona21nodecluster.
Fig 15. Cloudera’s Impala benchmark (Cloudera, 2016, February 11).
![Page 29: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/29.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
21
ThegraphaboveisoneofmanygraphsonCloudera’sblogshowinghowImpalabeatsitscompetitorsinallways(Cloudera,2016,February11).Itisworthnotingthatthisisanormalscenariocomingfromthevendorordeveloperstryingtoprovetheefficiencyoftheirproducts.
3.3 Hive not meant for low-latency querying BecausehiveisbasedonHDFSandHadoop,itismeantforbatchjobsfocusingonhighthroughputandscalabilityandcanbeusedforbothanalysisandETL.BecauseofHive’shighfaulttoleranceandwideusability,ithasbecomeandindustrywidestandardasadatawarehousetoolforHadoop(Turkington,2013).Italsothereforpreferredforbeingreliableandstableandotherlow-latencysolutionsareusedonthesideforad-hocquerying(Cloudera,2016).
![Page 30: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/30.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
22
4. Research methodology ThisthesiswillbethestartofaprocesstoexploreandtestSQLonHadoopqueryengines.ThescopeofthisthesiswillbetheevaluationofthequeryenginePresto.Prestowillbeevaluatedagainsthiveandthemainfocuswillbeonthelatencytimeoftheusecasequeries,Secondlyitwillalsoevaluatethetypeofusecasesitissuitableforandidentifyingandcomparingtheprosandconsagainstthebaselinesolutionhive.
Themethodologyusedinthisthesiswillbedesign-scienceandthedesign-scienceresearchprocessinterpretedbyPeffersetal.(2007)asthestepsinvolvedaresuitedforidentifying,implementingandevaluatingmychosensolution.Themainfocususingthismethodologywillbeontheevaluationofthesolutionratherthanbuildinganartefact,focuswillbeondeployingandconfiguringPrestothatisanalreadydevelopedopensourcequeryengine.Definingtherightusecaseswillalsobeabigpartofthedesign.
Thestepsindesign-scienceresearchprocessare:
1. Identifyproblem&Motivate.2. DefineObjectivesforsolution.3. Design&Developement.4. Demonstration.5. Evaluation.6. Communication.
Fig. 16 Design Science Research Methodology Peffers, Tuunanen, Rothenberger & Chatterjee, 2007).
4.1 Identify problems & Motivate Thefirststepinthedesign-scienceresearchprocessistoidentifyandmotivatetheproblems.Attruecallerthemainproblemisthelatencyandexecutiontimeofhivewhenrunningad-hocqueries.
Problems:
1. Hive/MapReduceistooslowfordatadiscovery/ad-hocqueries.
![Page 31: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/31.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
23
Asnotedearlier,theproblemwiththecurrentsetupattruecalleristhatthedatawarehouseisaccessedthroughHiveusingMapReduceonHadoop.Runningcomplexqueriestakesoftenunnecessarylongtimeandthisisbothbadfortheanalyticsteamandforthestakeholdersthatrequestad-hocanalysisonaregularbasis.MapReduceisbetterforbatchjobsthatcanbescheduledforstandardanalysisratherthanad-hocqueryinganddatadiscovery.HiveandMapReducetakestimebecauseoftheprocessingisdonebywritingtheoutputondisk.
4.2 Define Objectives for solution Theobjectivesforthesolutionwillinthiscasebetherequirementsforidentifyingthenewsolution.
Requirements:
Table 4. Requirements
Beforetryingtoidentifythedifferentsolutions,discussionswillbeheldwiththeendusersattheanalyticsteamtogatherrequirementsregardingthesolution.Therearemanyopensourcesolutionsthatcanbetested,andwiththerightrequirements,thelistcanbenarroweddown.
Heretheusecaseswillbedefinedbasedoneverydayad-hocanalysiscarriedoutbytheanalyticsteam.Theusecaseswillincrementallybecomemorecomplextocomparethetimeandresourcesusedbythesolutionswitheachother.AninitialbaselinewheretheusecasesareruninHivewouldbethesecondstepsothatthesolutionscanbecomparedwiththecurrentsituation.AlltheusecaseswillbetestedinsolutionssupportingSQL-syntaxandthequerieswillbethesameorvaryinnativesyntax.(Eg.HiveqldoesnotsupportallstandardSQLfunctionsandSQLcanalsovarydependingontheengine/dialect.)
4.3 Design & Development Inthisstepdesigninganddevelopingandartefactisthecommonpracticeindesign-science.Inthiscase,researchingandIdentifyingthesolutionwillbethepre-stepbeforedeployingthesolutiononfirstatest,andafterontheproductionenvironment(Hadoopcluster).
PrestowasthechosensolutionbecauseitmetalltherequirementsthatwasmentionedaboveandwaseasyenoughtodeployusingtheYARNapplicationversion.
Thefirststepwillbetotestthesolutiononasmallertestenvironmentbeforeimplementingittotheproductionenvironment.ForPresto,thetestenvironmentwillbeadockercontainer,whichisatype
1.Opensource2.SQLinterface3.SupportstoragefiletypeORC4.SupportsamedatatypesasHivee.g.structs,array,Map5.UDF/UDA(UserDefinedFunctions/Aggregations)6.Standardanalyticalfunctions7.Windowfunctions
![Page 32: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/32.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
24
ofvirtualenvironment.ThecontainerwillhaveaonenodeHadoopclusterinstalledandreadytoteston.Thetestsetupwasinstallingthesolution,importingadatasetandtestrunningsomerandomqueries.
4.4 Demonstration Thisstepwillbeforpreparingtheevaluationofthesolution.Inthiscasedefiningtheusecasesfortheevaluationisthemaingoalofthisstep.
Theimplementationofthesolutionwillnotbeheavilydocumented,eventhoughabigpartofthetimeinthisprojectwillbespentondeployingandconfiguringthesolutiontothecurrentenvironmentattruecaller.Thetestingintheproductionenvironmentwillbebasedontheusecasesthatwerepre-defined.ThisstepwillshowhowthesolutiondiffersfromHive,andwhattheprosandconsfortheparticularqueriesare.Theusecaseswillberun4times.2timesduringworkhourswhenthereisnormalloadontheclusterand2timesduringtheevening.Theaveragetimeinsecondswillbeusedtoshowtheresultofthelatencytime.
4.5 Evaluation Thisstepistoevaluatetheresultsofthesolution.
Theresultsoftheusecaseswillberecordedandvisualizedusingbarchartstoshowthedifferenceinlatencytimebetweenthedifferentsolutions.Theresultchapterofthethesiswillbemainlyfocusingonthisstepofthemethodology.Otherconclusionsfromtheusecaseswillalsobedrawnfromthebaselineandthenewsolutione.g.prosandconsandsuitableusecases.
4.6 Communication Thethesisitselfrepresentsthisstep.Togetherwiththethesis,theconclusionsandrecommendationsforfurthertestingattruecaller,apresentationatLTUscampuswillbeheldforsupervisorsandotherstudents.
![Page 33: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/33.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
25
5. Result
5.1 Cluster details TheHadoopclusterattruecallerconsistsof14nodeswithfollowingcomponents:
OS Linux(debian-kernel)3.2.0.4-amd64CPUmodel Intel(R)Xeon(R)[email protected] 32RAM 128GBHDD ~20TB
Table 5. Cluster details
5.2 Implementation Prestowasdeployedontruecaller’sHadoopclusterusingaversionthatallowsittorunasaYARNapplicationusingall14nodes.OnenodesharedasbothCoordinatorandWorkerandtherestasWorkers.Thewholesetupandconfigurationcanbefoundunderappendices7.1Prestoinstallation.
5.3 Use cases and latency in seconds 1.Select*withrestrictionandLimitationtype1.Tablesize14Billionrows.
Hive:~20sec
Presto:~5sec
Conclusion:
Presto4xfaster.
2.Select*withrestrictionandLimitationtype2.Tablesize14Billionrows.
Hive:~40sec
Presto:~7sec
Conclusion:
Presto~6xfaster.
3.Select*withrestrictionandLimitationtype3.Tablesize14Billionrows.
Hive:~200sec
Presto:~15sec
Conclusion:
Presto~13xfaster.
4.Selectcountwithrestriction.Tablesize14Billionrows.
![Page 34: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/34.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
26
Hive:~90sec
Presto:~23sec
Conclusion:
Presto~4xfaster.
5.Select*withrestriction,limitationandOrderBy.Tablesize14Billionrows.
Hive:~340sec
Presto:RunsoutofmemorybecauseofOrderBy,tooexpensive
Conclusion:
Notsuitableusecaseforpresto.
6.Selectwith2aggregatefunctions.Tablesize14Billionrows.
Hive:~600sec
Presto:~120sec
Conclusion:
Presto~5xfaster.
7.Selectwithaggregates,groupbyandlimitationtype1.Tablesize14Billionrows.
Hive:~120sec
Presto:~40sec
Conclusion:
Presto~3xfaster.
8.Selectwithaggregates,groupbyandlimitationtype2.Tablesize14Billionrows.
Hive:~120sec
Presto:~50sec
Conclusion:
Presto~2,5xfaster.
9.Selectwithmultipleaggregates,groupby,limitationsandJointable3times.Tablesize14Billionrows.
Hive:~6000sec
Presto:~7000sec
Conclusion:
Hive1,2xfaster.
![Page 35: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/35.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
27
5.4 Result table
Fig 17. Query execution time in seconds
20 5 40 7 200 15 90 23 340 600 120 120 40 120 50 600070000
50
100
150
200
250
300
350
400
450
500hive
presto
hive
presto
hive
presto
hive
presto
hive
presto
hive
presto
hive
presto
hive
presto
hive
presto
q1 q2 q3 q4 q5 q6 q7 q8 q9
Second
s
![Page 36: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/36.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
28
6. Discussion
6.1 Presto at truecaller LookingatthequerieswherePrestoperformedfasterthanHive,theaveragegainoflatencytimeisaround5timesfaster.Thisisabitlowerthansomeexamplebenchmarksthatcanbefoundonline.Howeveritisstillabiggainwhenneededtorunqueriesoverandoveragainduringwork.Prestoshowedclearlythatitismoresuitedforreturningfastinsightsonasmallerscaleandissuitedforad-hocanalysisforsomeonewhoisfamiliarwiththedatasetsandknowswhattoandwheretolookfor.Itislesssuitedformoreexpensivequerieswheretheworkerneedstostorewithlargeamountsofmetadata,andriskstorunoutofmemory.Intheconfigurationandtesting,allocating70GBofmaximumquerymemorywasnotenoughforquery5whichcontainedanorderbyoperation,andranoutofmemory.Prestoisalsonotsuitedformultiplejoinsbetweenlargetables,whereitperformedslowerthanHive.Intruecaller’scasePrestoisfineformanyofthecommonday-to-dayad-hocusecases.Attruecallermultipletoolsareusedwithintheanalyticsteamandknowingtheprosandconsofthetoolshelpstheuserstochoosewhichonetogofor.ForsomeETLjobs,Sparkisbeingusedandtherefortheteamwillnotsufferfromaddinganothertooltotheirtoolbox.Prestohassomeminorsyntaxdifferencesinbuttheyareeasytolearnandshouldnotbeseenasanobstacle.AnotherfindingfromthisresearchisthatPrestocanbeconfiguredandtunedonverydetailedlevel.Theresultspresentedinthegraphareafternumerousiterationsofconfigurationsandtuning.Thereisalsothestandardversion,whichrequiresmoreconfigurationsonanoperatingsystemlevel.TheversiontestedattruecallerwastheYARNapplicationversion.ThereforethereismoreroomfortuningandtestingPrestobeforechoosingthefinalsettingsfortruecaller’sHadoopclusterandenvironment.MyrecommendationwouldbeaftertuningandconfiguringPresto,tousePrestoforsomemorelightweightqueriestogetherwithHiveforthebiggerjobs(ETL,batchetc.).
6.2 Use cases TheusecaseschosenfortheevaluationarebasedonworkIdidfortruecalleronthesideofthethesiswork.Thequeriesusedareactualad-hocrequestsfortheanalyticsteam,whichItookupontolearnmoreaboutthedatathatwasanalysedonadailybasis.ThatwayIgotagoodpictureaboutwhattypesofquerieswereoftenrunandwillberuninanewsolutionifimplemented.
6.3 Limitations Thegoalfromthebeginningwastoevaluate2-3solutionsandcomparethemtoHivelikeIdidwithPresto.Duetosomeresourcelimitationsandthecomplexityoftheproject,Icouldnotproceedwithtestingmoresolutions.Thishadmostlytodowithmenotriskingcompromisingthedailyworkbytryingtoimplementnewsolutionsontheproductionclusterwithoutseniorsupervision.
6.4 Commercial alternatives Thereasontotestopensourcesolutionswasasearliermentionedtokeepthecostsdown.ThereareotherwisecommercialalternativesthatperformverywellandrunstablewithHadoop.AnexampleisExasolanin-memoryanalyticsdatabase.IhavetestedandworkedonExasolinotherprojectsandwouldrecommenditasanlow-latencysolutionforcompaniesthatcanconsiderpayingforsuchasolution.
![Page 37: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/37.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
29
6.5 Further research TuningandconfiguringPrestowithprofessionalhelpwouldbeafirststeptoseehowmuchfasterPrestocouldbecome.Itwouldalsobeinterestingtoseeifthereareanygainsregardingheavierqueriesandjobscomparedtothecurrentconfigurationsattruecaller.AnextstepwouldbetoseeifgainsinexecutiontimewouldbepossiblebytryingtotunetheparametersoftheORCfile.Thiswasanotherintentiontotestduringthisresearchbutcouldnotbedoneduetotimelimitations.
ApacheDrillandApacheTajoaretheothertwosolutionsthatweremeanttobetestedandevaluatedinthisthesiswork.Tajowasactuallyimplementedbutdidnotsupportsomeoftherequiredfiletypesanddatatypes.IhavebeeninclosecontactwiththedevelopersofTajoanditshouldmeetalloftruecaller’srequirementsinthenextrelease,whichwasinitiallyplannedforsometimeinApril2016,butwasdelayed.Afterthereleaseandifthereistimeandresources,lookingintobothofthesesolutionscouldbeofinterestfortruecaller.
Ifthereareadditionalresourcesandtime,IwouldrecommendtestingImpalabychangingfileformatsfromORCtoParquetanddorunthesametypeoftests.Becauseoftheresultsfrommostofthebenchmarksreadduringthisresearch,ImpalaisproventobeastrongcandidateforalightweightSQLonHadooptool.Theonlyproblemisthedifferenceinthefiletypes(ORCvs.Parquet)thattruecallerhaschosentostore.
![Page 38: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/38.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
30
7. Conclusion TherearemanySQLonHadoopsolutionsonthemarkettoday.PrestoisonoftheknownnamesofopensourceSQLonHadoopsolutionsandwaschosenasthetooltobetestedataftermeetingalltherequirements.SinceonlyPrestowastested,duetolimitations,itcannotbesaidthatitistheonlyand/orperfectsolutionfortruecaller’scase.Howeverithasansweredtheresearchquestionsofthisthesisbyprovingtobeonaverage5timesfasterthanHive.Prestowasprovenworkbestforlighterad-hocanalysisrunningonsmallerdatasetsorsampledatatakenformlargersetstoproveapoint.Prestodidnotworkwellwithlargerdatasetsandheavierqueriescontainingmultiplejoins/selfjoins.Prestocanbetunedandoptimizedtofittheparticularenvironmenttoperformatitsbest.Theuse-casesandalltestingweredoneattruecallerinStockholmandtheresultsarealldependingontheconfigurationandhardwareofthetruecaller’sHadoopcluster.Prestoisnowinusebytheanalyticsteamattruecallerforspecificad-hocusecasesandwillnotpermanentlyreplaceHiveoranyothertoolatthecompany.User’sthatknowthedatawillbeabletomaketherightdecisionforwhentoandnottousePresto.UsingPrestofortherightuse-caseswillsavetruecaller’sanalyticsteamtimeandwilladdefficiencytotheirwork,whichiswasthemainpurposeforthemtowanttotestanothersolution.BydeployingPrestoattruecaller,thisgoalhasbeenreachedfornow.
![Page 39: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/39.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
31
8. References
8.1 Books and articles • White,T.(2010).Hadoop:thedefinitiveguide.(2nded.)Farnham:O'Reilly.• Chen,Y.,Qin,X.,Bian,H.,Chen,J.,Dong,Z.,Du,X.,&Zhang,H.(2014).AStudyofSQL-on-
HadoopSystems.LectureNotesInComputerScience,(8807),154-166.• Walker,R.(2015).FromBigDatatoBigProfits:SuccesswithDataandAnalytics.Oxford
UniversityPress.• Gandomi,A.,&Haider,M.(2015).Beyondthehype:Bigdataconcepts,methods,and
analytics.InternationalJournalOfInformationManagement,35137-144.doi:10.1016/j.ijinfomgt.2014.10.007
• NationalInstituteofStandardsandTechnology.(2015)BigDataInteroperabilityFramework:Volume1,DefinitionsU.S.DepartmentofCommerce.
• Turkington,G.(2013).HadoopBeginner'sGuide.Birmingham:PacktPublishing.• Cervone,F.(2015).BIGDATAANDANALYTICS.OnlineSearcher,39(6),22-27.• Maitrey,S.,&Jha,C.(2015).MapReduce:SimplifiedDataAnalysisofBigData.Procedia
ComputerScience,57(3rdInternationalConferenceonRecentTrendsinComputing2015(ICRTC-2015),563-571.doi:10.1016/j.procs.2015.07.392
• Jukić,N.,Sharma,A.,Nestorov,S.,&Jukić,B.(2015).AugmentingDataWarehouseswithBigData.InformationSystemsManagement,32(3),200-209.doi:10.1080/10580530.2015.1044338
• Thusoo,A.,Sarma,J.,Jain,N.,Shao,Z.,Chakka,P.,Anthony,S.,&...Murthy,R.(2009).Hive-Awarehousingsolutionoveramap-reduceframework.ProceedingsOfTheVLDBEndowment,2(2),1626-1629.
• Connolly,T.M.&Begg,C.E.(2005).Databasesystems:apracticalapproachtodesign,implementationandmanagement.(4.,[rev.]ed.)Harlow:Addison-Wesley.
• Maposa,T.T.,Sethi,M.(2015).SQL-on-Hadoop:TheMostProbableFutureinBigDataAnalytics.AdvancesinComputerScienceandInformationTechnology(ACSIT),2(13),9-14.
• Srinivasa,K.G.,Mupalla,A.K.(2015).GuidetoHighPerformanceDistributedComputing,casestudieswithHadoop,ScaldingandSpark,Springer
• Soliman,M.,Petropoulos,M.,Waas,F.,Narayanan,S.,Krikellas,K.,&Baldwin,R.etal.(2014).Orca.ProceedingsOfThe2014ACMSIGMODInternationalConferenceOnManagementOfData-SIGMOD'14,337-348.http://dx.doi.org/10.1145/2588555.2595637
• Choudhury,K.(2015,October12).Truecaller'sIndiauserbasegrowsto100mn,BusinessStandard,NewDelhi.Retrievedfromhttp://www.business-standard.com/article/companies/truecaller-s-india-user-base-grows-to-100-mn-115101200037_1.html
• Rukmini,S.(2014,December13).Indialosingwaronspam,TheHindu.Retrievedfromhttp://www.thehindu.com/sci-tech/technology/10-lakh-complaints-about-spam-calls-in-the-last-three-years/article6689376.ece
• Peffers,K.,Tuunanen,T.,Rothenberger,M.A.,&Chatterjee,S.(2007).ADesignScienceResearchMethodologyforInformationSystemsResearch.JournalOfManagementInformationSystems,24(3),45-77.
![Page 40: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/40.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
32
8.2 Websites and blog posts • Truecaller.(2016).MAKINGYOURPHONESMARTER.Retrievedfrom
https://www.truecaller.com/about• Apache.(2016).HadoopWiki,Projectdescription.Retrievedfrom
http://wiki.apache.org/hadoop/ProjectDescription• ApacheHadoop.(2016).ApacheHadoopYARN.Retrievedfrom
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html• ApacheSliderIncubator.(2016).ApacheSliderArchitecture.Retrievedfrom
https://slider.incubator.apache.org/design/architecture.html• Teradata.(2016).Presto,aFaster,MoreScalable,MoreFlexibleOpenSourceSQLonHadoop
Engine.Retrievedfromhttp://assets.teradata.com/resourceCenter/downloads/Datasheets/EB8901.pdf?processed=1
• Presto.(2016).Overview.Retrievedfromhttps://prestodb.io/overview.html• Cloudera.(2016).Whatishive?Retrievedfrom
https://archive.cloudera.com/cdh4/cdh/4/hive/• AllegroTech.(2015,June10).FastDataHackathon[Blogpost].Retrievedfrom
http://allegro.tech/2015/06/fast-data-hackathon.html• Pivotal.(2014,June25).PivotalHAWQBenchmarkDemonstratesUpTo21xFaster
PerformanceonHadoopQueriesThanSQL-likeSolutions[Blogpost].Retrievedfromhttps://blog.pivotal.io/big-data-pivotal/products/pivotal-hawq-benchmark-demonstrates-up-to-21x-faster-performance-on-hadoop-queries-than-sql-like-solutions
• Cloudera.(2016,February11).NewSQLBenchmarks:ApacheImpala(incubating)UniquelyDeliversAnalyticDatabasePerformance[Blogpost].Retrievedfromhttps://blog.cloudera.com/blog/2016/02/new-sql-benchmarks-apache-impala-incubating-2-3-uniquely-delivers-analytic-database-performance/
• Fishman,J.(2014,July14).AppleSolarFrenzy:BenefitingShareholdersandthePlanetAlike[Blogpost].Retrievedfromhttp://solarstockideas.com/apple-solar-frenzy-benefiting-shareholders-planet-alike/
![Page 41: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/41.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
33
9. Appendices
9.1 Appendix A – Presto installation
Guide to install presto on hadoop cluster.
On docker container with cloudera one node cluster for test: Prerequisits:
§ HDFS§ YARN§ ZooKeeper§ Java§ Maven
Step 1. Install Slider to run presto application on yarn:
1. Create yarn user on containers hdfs with:
hdfs dfs -mkdir -p /user/yarn hdfs dfs -chown yarn:yarn /user/yarn
1. Downloadslider0.80.0foryarn(http://apache.mirrors.spacedump.net/incubator/slider/0.80.0-incubating/slider-assembly-0.80.0-incubating-all.tar.gz)
2. Untarfilewithcommand:tar -xvf slider-0.80.0-incubating-all.tar.gz
3. Opennewdirectoryandgotoslider-assembly/targetdirectoryandcopy0.80.0-incubating-all.ziptoanewfolderwhereyouwanttoinstallslider.
4. Unzipfileinfolder,thesliderfolderwillnowbecreated.5. Gotosliderfolder/confandopenfileslider-env.shsetenvvariables:
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/ export HADOOP_CONF_DIR=/etc/hadoop/conf
6. Open file slider-client.xml change set values:
<property> <name>hadoop.registry.zk.quorum</name> <value>localhost:2181</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>localhost:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>localhost:8030</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:8020</value> </property>
7. Test slider with command :
${slider-install-dir}/slider-0.80.0-incubating/bin/slider version
![Page 42: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/42.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
34
If you get output "Compiled against Hadoop 2.6.0", slider is installed right.
Step 2. Install presto:
1. Download presto-yarn package from github:(gitclonehttps://github.com/prestodb/presto-yarn.git)
2. Buildpackagewithmaven:mvn clean package
3. Go to folder created and folder presto-yarn-package/target and copy presto-yarn-package-1.0.0-SNAPSHOT-0.130.zip into slider folder.
4. Gobacktopresto-yarn-packagefolderandto/src/main/resourcesandcopyfilesappConfig.jsonandresources-singelnode.jsonintosliderfolder.
5. Insliderolderrenameresources-singlenode.jsontoresources.jsonandopenitandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},"global":{"yarn.vcores":"1"},"components":{"slider-appmaster":{},"COORDINATOR":{"yarn.role.priority":"1","yarn.component.instances":"1","yarn.memory":"1500"}}}
6. OpenappConfig.jsonandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},"global":{"site.global.app_user":"yarn","site.global.user_group":"hadoop","site.global.data_dir":"/var/lib/presto/data","site.global.config_dir":"/var/lib/presto/etc","site.global.app_name":"presto-server-0.130","site.global.app_pkg_plugin":"${AGENT_WORK_ROOT}/app/definition/package/plugins/","site.global.singlenode":"true","site.global.coordinator_host":"${COORDINATOR_HOST}","site.global.presto_query_max_memory":"50GB","site.global.presto_query_max_memory_per_node":"512MB","site.global.presto_server_port":"8080",
![Page 43: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/43.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
35
"site.global.catalog":"{'hive':['connector.name=hive-cdh5','hive.metastore.uri=thrift://${NN_HOST}:9083'],'tpch':['connector.name=tpch']}","site.global.jvm_args":"['-server','-Xmx1024M','-XX:+UseG1GC','-XX:G1HeapRegionSize=32M','-XX:+UseGCOverheadLimit','-XX:+ExplicitGCInvokesConcurrent','-XX:+HeapDumpOnOutOfMemoryError','-XX:OnOutOfMemoryError=kill-9%p']","application.def":".slider/package/PRESTO/presto-yarn-package-1.1-SNAPSHOT-0.130.zip","java_home":"/usr/lib/jvm/java-8-oracle/"},"components":{"slider-appmaster":{"jvm.heapsize":"128M"}}}Testdifferentsettingsforenvironmentnodememory1024MBwastoomuchforusandwechangedto:site.global.presto_query_max_memory_per_node":"512MB"
7. Insliderfolderchangeusertoyarn(suyarn)andruncommandtoinstallpresto:
bin/slider package --install --name PRESTO --package presto-yarn-package-*.zip
8. Run presto with command:
bin/slider create presto1 --template appConfig.json --resources resources.json
If exited with status 0 presto app shoul now be running.
9. Go to YARN resourcemanager web UI to check if presto is running (http://science2.truecaller.net:8088) : (See first row application presto1 running).
10. Create new folder and download presto client jar file (https://repository.sonatype.org/service/local/artifact/maven/content?r=central-proxy&g=com.facebook.presto&a=presto-cli&v=RELEASE)
11. Rename jar file to to only presto (mv *.jar presto). Run command to make it executable:
chmod +x presto
12. To test presto go to hive client and create a table with data. Now go back to folder with presto as yarn user (su yarn) and type command to run presto:
./presto --server localhost:8080 --catalog system --schema default (or which schema wanted to use)
type show tables; if table is showing, tryout a query.
On hdp3 cluster:
![Page 44: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/44.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
36
Step 1. Install Slider to run presto application on yarn:
1. Create yarn user on containers hdfs with:
hdfs dfs -mkdir -p /user/yarn hdfs dfs -chown yarn:yarn /user/yarn
1. Downloadslider0.80.0foryarn(http://apache.mirrors.spacedump.net/incubator/slider/0.80.0-incubating/slider-assembly-0.80.0-incubating-all.tar.gz)
2. Untarfilewithcommand:tar -xvf slider-0.80.0-incubating-all.tar.gz
3. Opennewdirectoryandgotoslider-assembly/targetdirectoryandcopy0.80.0-incubating-all.ziptoanewfolderwhereyouwanttoinstallslider.
4. Unzipfileinfolder,thesliderfolderwillnowbecreated.5. Gotosliderfolder/confandopenfileslider-env.shsetenvvariables:
export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-x64/ export HADOOP_CONF_DIR=/etc/hadoop/conf
6. Open file slider-client.xml change set values:
<property> <name>hadoop.registry.zk.quorum</name> <value>---------------- </property> <property> <name>yarn.resourcemanager.address</name> <value>-----</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value> </value> </property> <property> <name>fs.defaultFS</name> <value>hdfs---8020</value> </property>
7. Test slider with command :
${slider-install-dir}/slider-0.80.0-incubating/bin/slider version
If you get output "Compiled against Hadoop 2.6.0", slider is installed right.
Step 2. Install presto:
1. Download presto-yarn package from github:(gitclonehttps://github.com/prestodb/presto-yarn.git)
2. Buildpackagewithmaven:mvn clean package
3. Go to folder created and folder presto-yarn-package/target and copy presto-yarn-package-1.0.0-SNAPSHOT-0.130.zip into slider folder.
4. Gobacktopresto-yarn-packagefolderandto/src/main/resourcesandcopyfilesappConfig.jsonandresources-singelnode.jsonintosliderfolder.
5. Insliderolderrenameresources-singlenode.jsontoresources.jsonandopenitandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},
![Page 45: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/45.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
37
"global":{"yarn.vcores":"8"},"components":{"slider-appmaster":{},"COORDINATOR":{"yarn.role.priority":"1","yarn.component.instances":"1","yarn.component.placement.policy":"1","yarn.memory":"4000"},"WORKER":{"yarn.role.priority":"2","yarn.component.instances":"13","yarn.component.placement.policy":"1","yarn.memory":"8000"}}
6. OpenappConfig.jsonandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},"global":{"site.global.app_user":"yarn","site.global.user_group":"hadoop","site.global.data_dir":"/var/lib/presto/data","site.global.config_dir":"/var/lib/presto/etc","site.global.app_name":"presto-server-0.142","site.global.app_pkg_plugin":"${AGENT_WORK_ROOT}/app/definition/package/plugins/","site.global.singlenode":"true","site.global.coordinator_host":"${COORDINATOR_HOST}","site.global.presto_query_max_memory":"50GB","site.global.presto_query_max_memory_per_node":"4GB","site.global.presto_server_port":"8080","site.global.catalog":"{'hive':['connector.name=hive-cdh5','hive.metastore.uri=thrift://hdp3.truecaller.net:9083'],'tpch':['connector.name=tpch']}","site.global.jvm_args":"['-server','-Xmx8192M','-XX:+UseG1GC','-XX:G1HeapRegionSize=32M','-XX:+UseGCOverheadLimit','-XX:+ExplicitGCInvokesConcurrent','-XX:+HeapDumpOnOutOfMemoryError','-XX:OnOutOfMemoryError=kill-9%p']","application.def":".slider/package/PRESTO/presto-yarn-package-1.1-SNAPSHOT-0.142.zip","java_home":"/usr/lib/jvm/jdk-8-oracle-x64/"},"components":{"slider-appmaster":{"jvm.heapsize":"128M"
![Page 46: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/46.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
38
}}}
7. Insliderfolderchangeusertoyarn(suyarn)andruncommandtoinstallpresto:
bin/slider package --install --name PRESTO --package presto-yarn-package-*.zip
8. Run presto with command:
bin/slider create presto1 --template appConfig.json --resources resources.json
If exited with status 0 presto app shoul now be running.
9. Go to YARN resourcemanager web UI to check if presto is running (See first row application presto1 running).
10. Create new folder and download presto client jar file (https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.142/presto-cli-0.142-executable.jar)
11. Rename jar file to to only presto (mv *.jar presto). Run command to make it executable:
chmod +x presto
12. To test presto go to hive client and create a table with data. Go to the resource manager web interface.find presto1 under running applications and find URL to coordinator node.
13. Now go back to folder with presto as yarn user (su yarn) and type command to run presto:
./presto --server h[URL to coordinator]:8080 --catalog hive --schema default (or which schema wanted to use)
typeshowtables;iftableisshowing,tryoutaquery
9.2 Appendix B – Use case queries
1.SELECT*withrestrictionANDLIMIT1:
SELECT*
FROMapp_events
WHEREUPPER(user.country_code)='IN'
ANDdt=20160221
ANDh=02
![Page 47: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/47.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
39
LIMIT10000;
2.SELECT*withrestrictionANDLIMIT2:
SELECT*
FROMapp_events
WHEREdt=20160221
ANDappsearch.typeIN('1','2')
LIMIT10000;
3.SELECT*withrestrictionANDLIMIT3:
SELECT*FROMapp_events
WHEREuser.register_id=2107422
ANDappsearch.type='2'
ANDdtbetween20160123AND20160125
LIMIT50;
4.SELECTCOUNT(*)withrestriction
SELECTCOUNT(*)FROMapp_events
WHEREupper(user.country_code)='IN'
ANDappsearch.type='2'
ANDdt=20160123;
5.SELECT*withrestrictionLIMITANDORDERBY:
SELECT*
FROMapp_events
WHEREdt=20160221
![Page 48: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/48.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
40
ANDuser.country_codeISNOTNULL
ANDappsearch.typeIN('1','2')
ORDERBYUPPER(user.country_code)
LIMIT10000;
6.SELECT*withrestrictionLIMITANDSORTBY:
SELECT*
FROMapp_events
WHEREdt=20160221
ANDh=02
ANDuser.country_codeISNOTNULL
ANDappsearch.typeIN('1','2')
SORTBYUPPER(user.country_code);
7.SimpleSELECTwithaggregates:
SELECT
SUM(appsearchv2.attempts[0].latency)sum_latency,
AVG(appsearchv2.attempts[0].latency)avg_latency
FROMdata.app_events
WHEREsize(appsearchv2.attempts)>0
ANDdtbetween20160130AND20160202
ANDupper(user.country_code)='IN'
ANDappsearchv2.attempts[0].endpointnotlike'%localhost%';
8.SELECTwithaggregatesANDGROUPBY:
SELECTuser.register_id,
appsearch.searchrequest.normalizedphonenumberasincoming_nr,
![Page 49: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/49.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
41
SUM(if(appsearch.type=2,1,0))incoming_calls,
SUM(if(appsearch.type=5,1,0))sms
FROMdata.app_events
WHEREdt=20160121
ANDappsearch.searchrequest.searchcountrycode='SE'
ANDappsearch.searchrequest.normalizedphonenumberISnotNULL
GROUPBYuser.register_id,
appsearch.searchrequest.normalizedphonenumber
LIMIT10000;
9.SELECTwithaggregateswithgroupANDorderby:
SELECTuser.register_id,
appsearch.searchrequest.normalizedphonenumberasincoming_nr,
SUM(if(appsearch.type=2,1,0))incoming_calls,
SUM(if(appsearch.type=5,1,0))sms
FROMdata.app_events
WHEREdt=20160121
ANDappsearch.searchrequest.searchcountrycode='SE'
ANDappsearch.searchrequest.normalizedphonenumberISnotNULL
GROUPBYuser.register_id,
appsearch.searchrequest.normalizedphonenumber
orderbyuser.register_id,
incoming_nr
LIMIT10000;
10.SELECTFROMinnerquerywithinnerGROUPBYANDouterorderby:
![Page 50: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/50.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
42
SELECTregister_id,
incoming_nr,
incoming_calls,
sms
FROM(
SELECTuser.register_id,
appsearch.searchrequest.normalizedphonenumberasincoming_nr,
SUM(if(appsearch.type=2,1,0))incoming_calls,
SUM(if(appsearch.type=5,1,0))sms
FROMdata.app_events
WHEREdt=20160125
ANDappsearch.searchrequest.searchcountrycode='SE'
ANDappsearch.searchrequest.normalizedphonenumberISnotNULL
GROUPBYuser.register_id,
appsearch.searchrequest.normalizedphonenumber)x
WHEREincoming_calls>0ORsms>0
orderbyregister_id
LIMIT10000;
11.JOINWITH2INNERQUERIESWITHAGGREGATESANDORDERBY:
SELECT
a.city,
COUNT(DISTINCTa.registerid)nr_IOS_profiles
FROM(
SELECT
LOWER(city)city,
registerid
FROMdata.user_profile_data
WHERELOWER(city)<>''
![Page 51: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/51.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
43
ANDdt=20160221
ANDregisterid<>0)a
INNERJOIN(
SELECT
id
FROMuserprimaryserviceprovider
WHERELOWER(os)LIKE'iphone%'
ANDdt=20160221)b
ONb.id=a.registerid
GROUPBYa.city
ORDERBYnr_IOS_profilesDESC
LIMIT10;
12.3joinswithaggregatesANDsortby:
SELECT
FROM_UNIXTIME(UNIX_TIMESTAMP()-86400,'yyyyMMdd')ASdt,
a.countrycode,
a.incoming_phonenr,
COALESCE(c.incoming_calls_24h,0L)incoming_calls_24h,
COALESCE(c.sms_24h,0L)sms_24h,
COALESCE(b.incoming_calls_3d,0L)incoming_calls_3d,
COALESCE(b.sms_3d,0L)sms_3d,
a.incoming_calls_7d,
a.sms_7d
FROM
(SELECT
countrycode,
incoming_phonenr,
incoming_calls_7d,
![Page 52: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/52.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
44
sms_7d
FROM(
SELECT
appsearch.searchrequest.searchcountrycodeascountrycode,
appsearch.searchrequest.normalizedphonenumberASincoming_phonenr,
SUM(if(appsearch.type=2,1,0))incoming_calls_7d,
SUM(if(appsearch.type=5,1,0))sms_7d
FROMdata.app_events
WHEREdt>=20160101ANDdt<=20160107
ANDupper(appsearch.searchrequest.searchcountrycode)IN('SE','US')
ANDappsearch.searchrequest.normalizedphonenumberISNOTNULL
GROUPBY
appsearch.searchrequest.searchcountrycode,
appsearch.searchrequest.normalizedphonenumber)x
WHEREincoming_calls_7d>0ORsms_7d>0
SORTBYcountrycode)a
LEFTJOIN
(SELECT
countrycode,
incoming_phonenr,
incoming_calls_3d,
sms_3d
FROM(
SELECT
appsearch.searchrequest.searchcountrycodeascountrycode,
appsearch.searchrequest.normalizedphonenumberASincoming_phonenr,
SUM(if(appsearch.type=2,1,0))incoming_calls_3d,
SUM(if(appsearch.type=5,1,0))sms_3d
FROMdata.app_events
WHEREdt>=20160101ANDdt<=20160103
![Page 53: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/53.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
45
ANDupper(appsearch.searchrequest.searchcountrycode)IN('SE','US')
ANDappsearch.searchrequest.normalizedphonenumberISNOTNULL
GROUPBY
appsearch.searchrequest.searchcountrycode,
appsearch.searchrequest.normalizedphonenumber)x
WHEREincoming_calls_3d>0ORsms_3d>0
SORTBYcountrycode)b
ONa.countrycode=b.countrycode
ANDa.incoming_phonenr=b.incoming_phonenr
LEFTJOIN
(SELECT
countrycode,
incoming_phonenr,
incoming_calls_24h,
sms_24h
FROM(
SELECT
appsearch.searchrequest.searchcountrycodeascountrycode,
appsearch.searchrequest.normalizedphonenumberASincoming_phonenr,
SUM(if(appsearch.type=2,1,0))incoming_calls_24h,
SUM(if(appsearch.type=5,1,0))sms_24h
FROMdata.app_events
WHEREdt=20160101
ANDupper(appsearch.searchrequest.searchcountrycode)IN('SE','US')
ANDappsearch.searchrequest.normalizedphonenumberISNOTNULL
GROUPBY
appsearch.searchrequest.searchcountrycode,
appsearch.searchrequest.normalizedphonenumber)x
WHEREincoming_calls_24h>0ORsms_24h>0
SORTBYcountrycode)c
![Page 54: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/54.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
46
ONb.countrycode=c.countrycode
ANDb.incoming_phonenr=c.incoming_phonenr
SORTBYcountrycode
;
13.Createtableoutof3joinsFROMlargetablewithmultipleaggregatesANDGROUPBY.
CREATETABLEsahir.nr_search_events_30days_v2AS
SELECT
a.dt,
a.country_code,
COALESCE(b.hits,0L)AStotal_result_hits,
COALESCE(a.hits,0L)AS1_result_hits,
COALESCE(c.hits,0L)AS0_result_hits
FROM
(SELECT
dt,
LOWER(user.country_code)country_code,
COUNT(*)hits
FROMdata.app_events
WHEREdtBETWEEN20160201AND20160301
ANDSIZE(search.sources)>0
ANDLOWER(search.sources[0])LIKE'%ugc%'
ANDsearch.result_size=1
GROUPBY
dt,
LOWER(user.country_code))a
LEFTJOIN
(SELECT
dt,
![Page 55: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the](https://reader035.vdocuments.us/reader035/viewer/2022070720/5ee115c6ad6a402d666c1805/html5/thumbnails/55.jpg)
SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller
47
LOWER(user.country_code)country_code,
COUNT(*)hits
FROMdata.app_events
WHEREdtBETWEEN20160201AND20160301
ANDSIZE(search.sources)>0
ANDsearch.result_size>0
GROUPBY
dt,
LOWER(user.country_code))b
ONa.dt=b.dt
ANDLOWER(a.country_code)=LOWER(b.country_code)
LEFTJOIN
(SELECT
dt,
LOWER(user.country_code)country_code,
COUNT(*)hits
FROMdata.app_events
WHEREdtBETWEEN20160201AND20160301
ANDSIZE(search.sources)>0
ANDsearch.result_size=0
GROUPBY
dt,
LOWER(user.country_code))c
ONb.dt=c.dt
ANDLOWER(b.country_code)=LOWER(c.country_code)