bachelor thesis -...

55
BACHELOR THESIS Evaluating Presto as an SQL on Hadoop Solution A Case at Truecaller Sahir Ahmed 2016 Bachelor of Arts Systems Science Luleå University of Technology Department of Computer science, Electrical and Space engineering

Upload: others

Post on 03-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

BACHELOR THESIS

Evaluating Presto as an SQL on HadoopSolution

A Case at Truecaller

Sahir Ahmed2016

Bachelor of ArtsSystems Science

Luleå University of TechnologyDepartment of Computer science, Electrical and Space engineering

Page 2: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

EvaluatingPrestoasanSQLonHadoopsolution

acaseattruecaller

SahirAhmed–VT2016

Page 3: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

Abstract Truecallerisamobileapplicationwithover200millionuniqueusersworldwide.Everydaytruecallerstoresover1billionrowsofdatathattheyusetoanalyseforimprovingtheirproduct.ThedataisstoredinHadoop,whichisaframeworkforstoringandanalysinglargeamountsofdataonadistributedfilesystem.Inordertobeabletoanalysetheselargeamountsofdatatheanalyticsteamneedsanewsolutionformorelightweight,ad-hocanalysis.ThisthesisevaluatestheperformanceofthequeryenginePrestotoseeifitmeetstherequirementstohelpthedataanalyticsteamattruecallergainefficiency.Byusingadesign-sciencemethodology,Presto’sprosandconsarepresented.Prestoisrecommendedasasolutiontobeusedtogetherwiththetoolstodayforspecificlightweightusecasesforusersthatarefamiliarwiththedatasetsusedbytheanalyticsteam.Othersolutionsforfutureevaluationarealsorecommendedbeforetakingafinaldecision.

Keywords:Hadoop,BigData,Presto,Hive,SQLonHadoop

Page 4: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

Table of contents Abstract.............................................................................................................................2

Tableofcontents...............................................................................................................3

Tableoffigures..................................................................................................................5

Tableoftables...................................................................................................................6

Glossary.............................................................................................................................7

1. Introduction................................................................................................................11.1Problemstatement................................................................................................................1

1.1.1Problemdescription...............................................................................................................11.1.2Researchquestions................................................................................................................11.1.3Purpose..................................................................................................................................1

1.2Motivationsforresearch........................................................................................................21.3Delimitations.........................................................................................................................2

2. Background.................................................................................................................32.1Truecaller...............................................................................................................................32.2BigData.................................................................................................................................3

2.2.1WhatisBigData?...................................................................................................................32.2.2DataAnalytics.........................................................................................................................52.2.3DataWarehouse.....................................................................................................................5

2.3Hadoop..................................................................................................................................72.3.1WhatisHadoop......................................................................................................................72.3.2TruecallerandHadoop...........................................................................................................72.3.3MapReduce............................................................................................................................82.3.4HDFS.....................................................................................................................................102.3.5YARN.....................................................................................................................................11

2.3.5.1Slider.............................................................................................................................................132.3.6SQLonHadoop.....................................................................................................................13

2.3.6.1Hive..............................................................................................................................................132.3.6.2Presto...........................................................................................................................................15

3. Literaturereview.......................................................................................................173.1Purposeofliteraturereview................................................................................................173.2Benchmarks.........................................................................................................................17

3.2.1Fast-data-hackathon............................................................................................................173.2.2RenminUniversity................................................................................................................183.2.3Commercialbenchmarks......................................................................................................19

3.2.3.1Pivotal(HAWQ)............................................................................................................................193.2.3.2Cloudera(Impala).........................................................................................................................20

3.3Hivenotmeantforlow-latencyquerying.............................................................................21

4. Researchmethodology..............................................................................................224.1Identifyproblems&Motivate..............................................................................................224.2DefineObjectivesforsolution..............................................................................................234.3Design&Development........................................................................................................23

Page 5: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

4.4Demonstration.....................................................................................................................244.5Evaluation............................................................................................................................244.6Communication...................................................................................................................24

5. Result........................................................................................................................255.1Clusterdetails......................................................................................................................255.2Implementation...................................................................................................................255.3Usecasesandlatencyinseconds.........................................................................................255.4Resulttable..........................................................................................................................27

6. Discussion..................................................................................................................286.1Prestoattruecaller..............................................................................................................286.2Usecases.............................................................................................................................286.3Limitations...........................................................................................................................286.4Commercialalternatives......................................................................................................286.5Furtherresearch..................................................................................................................29

7. Conclusion.................................................................................................................30

8. References.................................................................................................................318.1Booksandarticles................................................................................................................318.2Websitesandblogposts......................................................................................................32

9. Appendices................................................................................................................339.1AppendixA–Prestoinstallation..........................................................................................339.2AppendixB–Usecasequeries.............................................................................................38

Page 6: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

Table of figures Fig1.Exponentialgrowthofdata(Fishman,2014,July14).................................................4

Fig2.TheBigdataprocessandsubprocesses(Gandomi&Haider,2015)...........................5

Fig3.Starschema(Connolly&Begg,2005)........................................................................6

Fig4.SimplifiedMapReducedataflow(White,2010).........................................................7

Fig5.MapReduceflowwithonereducer(White,2010).....................................................9

Fig6.MapReducedataflowwithmultiplereducers(White,2010)...................................10

Fig7.HDFSArchitecture(Srinivasa&Mupalla,2015)........................................................11

Fig8.ApplicationsrunningonYARN(White,2010)...........................................................12

Fig9.YARNcomponents(ApacheHadoop,2016)..............................................................13

Fig10.HiveArchitecture(Thusooetal.,2009)..................................................................15

Fig11.PrestoArchitecture(Teradata,2016).....................................................................16

Fig12.fast-data-hackathonresults(TreasureDataBlog,2015,March20).........................18

Fig13.RenminUniversityresultsfor100-nodecluster(Chenetal.,2014).........................19

Fig14.Hawq’sspeedupratiovs.Impala(Solimanetal.,2014).........................................19

Fig15.Cloudera’sImpalabenchmark(Cloudera,2016,February11)................................20

Fig.16DesignScienceResearchMethodologyPeffers,Tuunanen,Rothenberger&Chatterjee,2007)..............................................................................................................22

Fig17.Queryexecutiontimeinseconds...........................................................................27

Page 7: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

Table of tables Table1.Glossary................................................................................................................7

Table2.Whatisbigdata?(Cervone,2015).........................................................................4

Table3.RDBMScomparedtoMapReduce(White,2010)...................................................8

Table4.Requirements......................................................................................................23

Table5.Clusterdetails......................................................................................................25

Page 8: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

Glossary Hadoop OpensourceJavaframeworkfor

distributeddatastorageHDFS HadoopdistributedfilesystemYARN (Yetanotherresourcenegotiator)

resourcemanagerforHadoopRDBMS Relationaldatabasemanagement

systemNode Commoditycomputer(hardware)used

byHDFSCluster Agroupofconnectednodesusedby

HDFSHive SQLdatawarehouseforHadoopMapReduce Programmingmodelforprocessingdata

onHadoopPresto SQLontopofHadoopsolutionSlider Applicationtorunnon-Yarn

applicationsonYARN

Table 1. Glossary

Page 9: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

1

1. Introduction Thissectionintroducesthereadertotruecaller,thecompanywherethisstudyisconductedandgivessomebackgroundinformationaboutthecurrentsituationregardingthedatathatisstoredandanalysed.Followingthatinformationabackgroundregardingthedatawarehouseandtoolsusedfordataanalyticswithintheanalyticsteamoftruecallerisexplained.Hadoopisoneofthemaintoolsusedanddescribedonafairlyhigh-level.Truecallerisamobileappusedbyover200millionusersworldwide.Attruecallertoday,thedailyincomingdata(abouthalfabillionrows)isstoredinasystemdistributedfilesystemusingtheopensourceJavaframeworkHadoop.Thecollecteddataisusedbyallthedepartmentsofthecompanyforimprovingtheproductandtakingnewdecisions.Inordertoanalysethelargeamountsofdata,theData-AnalyticsteamislookingforanewlowlatencysolutionthatcanbeusedasacomplementtotheirstandardHadooptools,whichinsomecasesareconsideredtooslow.Thegoalofthisresearchistoidentify,testandevaluateanumberoftoolsthatfitstheteam’srequirementsandcouldpotentiallybeimplemented.

1.1 Problem statement “Runningad-hocqueriesonHiveistooslowfordatadiscovery”–BjörnBrinne,HeadofData-AnalyticsatTruecaller

1.1.1 Problem description Theproblemwithinthedata-analyticsorganisationattruecalleristhattheuseofHiveisnotfastenoughforad-hocqueryingandon-the-flydatadiscovery.TheusersruntheirqueriesinHiveandlongerquerieswithmultiplejoinsandaggregationscantakeuptoseveralhourstoexecute.Afterthequeryhasrun,theusermightnotbesatisfiedwiththeoutcomeandwillhavetoadjustandrerunthequery,sometimesinmultipleiterations.Thismakesthewholeprocessineffectiveandholdsbackthedatascientistsandotherusersfromusingtheircreativitytoaddmorevaluetotheirjobandcompanybecauseofpoorperformanceofthetools.Italsolimitsthemtojustfollowwhatisprioritizedatthattimeandleadstosomeworknotbeingdoneorpushedreallyfarbackinpriorityand.Thiscanalsomaketheanalyticsorganisationlessinnovative.

1.1.2 Research questions 1.HowmuchfasterthanHiveisPrestoontruecaller’sHadoopcluster?2.WhichtypeofusecasesisPrestosuitable/notsuitablefor?

1.1.3 Purpose Thepurposeofthisprojectistoidentify,testandimplementasolutiontosupportdata-discoveryandad-hocqueryingofdatainthetruecallerdatawarehouse.ThesolutionisaimedtocomplementHadoopandwillhopefullybeimplementedandusedbytheanalytics-teamattruecaller.Thereareseveralcommercialandopensourceproductsaresupposedtoprovidethatfunctionalitytotheusers.Thoseproductsoftenvaryinperformanceanddependontheenvironmentthattheyaresetupin.Therefortheaimandpurposeofthisresearchisonlymeanttosolvetheissuesanduse-casesattruecallerandisnotintendedtobeusedasageneralrecommendationorsolutionforothercompaniestorelyupon.However,theoutcomescanprovidethereaderswithanoverviewthatcan

Page 10: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

2

becomparedtootherarticlesonthesubject.Themaingoalistoadvicetruecalleronwhichoftheidentifiedsolutionstoimplementanduseasananalyticaltool.

1.2 Motivations for research ThemainmotivationistotestandevaluatenewertechniqueswithinthegrowingBigDatafieldandtocomparethemwiththeindustrystandardtoolHive.Asecondarymotivationistoprovidesupportandhelptothedata-analyticsteamattruecaller,whichisundergoingrapidgrowth,andwishestofindamoresuitabletoolforfasterqueryingandanalysisoftheirdailyincomingdata.

1.3 Delimitations Theprojectwilltakeplacehalftimeunder4monthsandwillbebasedonpre-specifiedusecasesfortruecaller.Onesolutionistobetested.Firstonatestenvironmentlikee.g.avirtualonenodeclusterandasasecondstageintheproductionenvironment.Thesolutionisfreeopensourcetokeepthecostsdown.Thisprojectwillmostlyfocusonhowfastthedataispresentedandwillnotgoindepthonhowthetechniquesworkbehindthescenes,duetothesubjectofInformationSystemsScience.Thedatasetswillbepredefinedandtheaimsotofinallyusethedatawarehouseanddailyincomingdata.

Page 11: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

3

2. Background

2.1 Truecaller TruecallerisaSwedishmobileapplicationcompanybasedinStockholm.Theirflagshipapplication;truecallerisdevelopedtoprovideuserswithinformationaboutwantedandunwantedcallsfromunknownnumbers.Bydoingthattruecallerhelpstheusersidentifyingspamcallsandmessages,givingthempossibilitiestoblockthosenumbersandalsosharethisinformationwiththerestoftheuserbase(Truecaller,2016).

Truecallerhastodayover200millionusersandissteadilygrowing.ThelargestusergroupislocatedinIndiawheretherearemorethan100millionusers.InIndiaalone,truecallerhelpsidentifyabout900millioncallsand120millionspamcallseverymonth.Thecompanystoresover2billionnamesandnumbersinitsdatawarehousewhereitislateranalysedforfurtheractions.Workingtogetherwithseveraltelecomprovidershelpstruecallertoreachouttonewusersindifferentcountries(Choudury,2015,October12).

InIndiawheretruecallerisverypopular,thenumberofunwanted/spamcallsisabigproblem.Thegovernmenthasspecialregulationsconcerningspamcalls(Rukmini,2014,December13)andthiscanbeonefactortothepopularityofthetruecallerapplication.

Truecallerstoresover1billiondailyeventsintheirdatawarehouse,wherethisdataislaterusedforanalysis.Theseanalysiscanbebasedonuserdata,applicationperformance,A/Btestingetc.andareusedforfurtherdevelopmentoftheapplication.

2.2 Big Data “Therewere5Exabyte’sofinformationcreatedbetweenthedawnofcivilizationthrough2003,butthatmuchinformationisnowcreatedeverytwodays.”-EricSchmidt,SoftwareEngineerandExecutiveChairmanofGoogle(Walker,2015)

2.2.1 What is Big Data? ThedefinitionofBigDataisoftendiscussedandisapopularandrecurringsubjectwithintheITworld.Somechoosetodefine“What”BigDataiswhileotherschoosetolookatwhatit“Does”(Gandomi&Haider,2015).Thedefinitionisthereforenotcrystalclearanduseof“thethreeV:s(Volume,VarietyandVelocity)”hasemergedandisbeingusedtodescribeBigData.VolumehastodowiththesizeandisoftenthefirstthingthatcomesinmindwhentalkingaboutBigDataseefig.2.WhentalkingaboutsizesinBigData,sizesovermultiplePetabytesoftencomestomind(Walker,2015)Keepinmindthatthesesizesincreaseallthetime.ItincreasesevenfasterthanMoore’sLawthatmeansitwoulddoubleinsizeeverytwoyears.Theincreaseofdataactuallydoublesinsizeaboutevery18months(NationalInstituteofStandardsandTechnology,2015).Varietyconcerningthevaryingtypesofdata,whichshowsthatthemajority(over90%)ofthedataisunstructuredduetothedifferenttypesofdatathatisuploadedallthetimethroughdifferentchannelssuchas,photos,videos,tweetsetc.Structureddataisstoredinrelationaldatabasesintables;thisisnotthecasefordatastoredinHadoop.Velocityreferstothespeedofwhichthedataisproduced,storedandanalysedseefig.1.Overthelatestyearswheresmartphoneshavebecomewidelyused,therateofdatavelocityhasexplodedandtheneedsforreal-timeanalyticsareinhigh-demand.Usingdatafromendusersfromasmartphonesvendorscananalysedatacollectedfromthehandhelddevices

Page 12: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

4

suchasgeospatiallocation,phonenumbers,demographicsetc.(Gandomi&Haider,2015).Thesearethethree“main”V:sbutovertimeadditionalV:ssuchasVariability,Value,Validity,Veracityetc.havemadeitintodiscussionindefiningBigDataanddataanalytics(NationalInstituteofStandardsandTechnology,2015).

Fig 1. Exponential growth of data (Fishman, 2014, July 14).

Table 2. What is big data? (Cervone, 2015). HereisanexampleofoneofmanydefinitionsofBigData:“BigDataconsistsofextensivedatasets-primarilyinthecharacteristicsofvolume,variety,velocity,

Page 13: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

5

and/orvariability-thatrequireascalablearchitectureforefficientstorage,manipulation,andanalysis.”(NationalInstituteofStandardsandTechnology,2015)

Sowhat“Does”BigDatado,whenisitusedandbywho?Wetakealookatitinnextparagraph.

2.2.2 Data Analytics BigDataisuselessifitisjuststoredandneverused.Themainreasonforstoringallthedataistoanalyseitandsomehowmakeuseofit.Withthehelpofdataanalysis,companiescanmakeimportantdecisionsfortheirprocesses(Walker,2015).Thewaysofanalysingdataisnotsodifferentfromthepast,usingstatisticalmodelsandformulas.Thedifferenceismostlyaboutthesizeofthedatasetsandthespeedofanalysinglargeramountsofdata.Also,withthehelpoftoolssuchasHadoop,companiescanfocusmoreoncausationbecauseoftheavailabilityofdatathatcanbeprovided.Theuseofcausationhelpsustodeterminewhysomethingishappeningwhichleadstomakingbetter(NationalInstituteofStandardsandTechnology,2015).AccordingtoLambrinisandJagadish(2012)(Walker,2015),theprocessforextractinginsightsofdataisdividedintofivestagesthatarepartofthetwosubprocesses,data-managementandanalyticsseefig.3.Thefirstthreestagesarewithinthemorebackendprocessdata-managementandcoversacquiringandstoringthedata,cleaningandpreparingitandthenpresenting/deliveringittotheanalysisprocesswherethedataisactuallyanalysed,interpretedandvisualized(Gandomi&Haider,2015).

Fig 2. The Big data process and sub processes (Gandomi & Haider, 2015).

2.2.3 Data Warehouse Adatawarehouseisadataset/databasedatacollectedfromoneormultipledatasourcesusedtoanalysehistoricaldatatohelporganisationswithdecision-making.In1990BillInmonintroducedthetermdatawarehouse,andin1993hestatedthatdatawarehouseisdefinedas:

- Subject-oriented:Meaningthatthedatafocusesonspecificsubjectsregardingthebusinesse.g.customers,productsetc.

- Integrated:Becausethedataoftencomesfrommultiplesourcesthedatahastobere-structuredtoholdaconsistentpatternofthedatawarehouse.

Page 14: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

6

- Timevariant:Thedataisfetchedindifferenttimesanddatesithastobeshownsothatitcanmakesensefordifferenttimeintervalsandtimerelatedmatters.

- Non-volatile:Thedatagrowsincrementallyanddoesnotreplacetheolddatabutgrowssteadilyovertime(Connolly&Begg,2005).

DatawarehousesareoftenconnectedtoOLAP(OnlineAnalyticalProcessingTools)toolsforanalysis,miningandmachinelearning.TheprocesswhenadatawarehouseretrievesnewdatafromitsdatasourcesiscalledETL(extracttransformload).Datawarehousesareoftenmodelledusingastar-schema,whichdiffersfromanormalrelationaldatabasemodel(Jukić,Sharma,Nestorov&Jukić,2015).Astar-schemaisamultidimensionalmodelseefig.4usingfacttablesthatrepresentsabusinessobjecte.g.apurchase.Thefacttablesareconnectedtomultipledimensiontableswithspecificgrainsthattogethermakeupthefactsinthefacttablese.g.aproductdimension,timedimensionetc.Togethertheycanshowwhatproductwaspurchasedatwhattimeanddateetc.Datawarehousescanbemodelledwithdifferentschemastooe.g.snowflake-schema,OLAPcubeetc.(Connolly&Begg,2005).

Fig 3. Star schema (Connolly & Begg, 2005).

Page 15: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

7

2.3 Hadoop 2.3.1 What is Hadoop HadoopwascreatedbyDougCuttingbetween2003and2006.CuttingwasworkingonanopensourcesearchenginecalledNutchwherehewasinspiredbyGoogle’stechnologiesGFS(GoogleFileSystem)andMapReducethattogethermadeupaplatformforprocessingadataefficientlyonalargescale(White,2010).FromthosetechnologiesCuttingstarteddevelopingHadoopthatquicklybecameaprioritizedprojectwithinApacheopensourcefoundation.YahoowhosoonhiredCuttingstronglysupportedtheprojectwithhelpandresources.ThenameHadoopcomesfromCuttingsson’syellowtoyelephantandisalsowhythelogoforHadooplooksthewayitdoes.By2008,HadoopwasbeingusedbyseverallargecompanieslikeAmazon,Facebooketc.(Turkington,2013).

HadoopisaJavabasedframeworkfordistributeddatacomputationandstoring.It’slinearscalabilityandcheapinfrastructuremakesiteasytomaintainandupgrade,hencemakesitthemainactorforBigDataprocessing.Hadoopismadeupoftwomaintechnologies,HDFSandMapReduce,whicharecompletelydifferentbutcomplementeachother.ByusingHadoop,largesetsofdatacanbecomputedinparallelonallthenodesofthecluster.Thiswayallcomputingpowerandresourcescanbeutilizedefficiently(White,2010).

2.3.2 Truecaller and Hadoop TruecallerstoresallitsdatainHadoop.HadoopprocessesthedatabyMapReducejobsrunninginparalleloveraclusterofdistributednodesusingHDFS(HadoopDistributedFileSystem).MapReduceisaprogrammingmodelthatsplitsthedataprocessesinaquerywheretheMapstagesortsandpairsthedatainkey-valueformatandthenthedataissummarizedandpresentedintheReducephaseseefig.5.

Fig 4. Simplified MapReduce dataflow (White, 2010).

ThemainadvantageofusingHadoopinsteadofarelationaldatabaseforstoringandanalysingBigDataisbecauseofcostandperformance.HadoopisrelativelycheapbecausethenodesarenormalcommoditycomputersthatinteractinHDFS.Itisalsolinearlyscalablewhichmeansthatwhenitisneededtoaddmorecapacityorspace,addingnewnodestotheclustercaneasilysolvetheissue(White,2010).WhenusingaRDMSthedatahastobestoredondiskandinmemory,whichcanbeproblematicwhenitcomestolargeamountsofdatae.g.Petabytes.Therefore,spreadingtheworkloadonmultiplenodeshelpssolvethisissue(Chenetal.,2014).However,therearesomeotherdifferenceswhereHadooplosesagainsttheRDBMSseetable.1.ThetypesofqueriesthataremainlysupposedtobeexecutedinHadoopareMapReducebatchjobs.Hadoopwasnotmeanttobeused

Page 16: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

8

forinteractiveon-the-flyad-hocqueriesordatadiscovery.RDBMSarebetterforad-hocqueriesandprovideamuchmoreinteractiveuserinterfacethanHadoopwasintendedtodo.

Table 3. RDBMS compared to MapReduce (White, 2010).

InrecentyearsmanynewSQLonHadoopenginesthatletstheuserhaveanSQLinterfaceandlanguagetostructureandquerythedatawhileusingMapReduce,havebeendevelopedtosupportandprovidead-hocquerying.HiveisnowmoreorlessanindustrystandardtoolforSQLonHadoop(White,2010)(Chenetal.,2014).Hiveisusedasthemaindatawarehousetoolattruecaller.

2.3.3 MapReduce MapReduceisaprogrammingmodelusedforprocessingbigdatadistributedandinparallelacrosstheHadoopcluster.MapReduceisoneofthetwomainpartsofHadoopandcanbewritteninmanydifferentprogramminglanguages.Asnotedearlier,theMapReducemodelconsistsoftwodifferentphasesthe;

a. Map,whichtakestheinputdataand,b. Reduce,whichoutputsthefinalresult.

InMapstagetheinputdataisdividedintosplits,whichhaveonemaptaskforeachsplit.Themapfunctionthenrunsforeachrecordinthesplit,andYARN,whichisHadoop’sresourcemanagerschedulesthetasksanddividesthemamongthenodesinthecluster.IfajobfailsYARNseestoitthatitisre-runonanothermachine.HavingtherightsizeofsplitshelpstheperformanceoftheMapReducejob,becausewhenthesizeofthesplitsaretoobigthejobwillbedividedintofewermachinesandwilltakelongertoprocess,whileiftherearemanynodesandthesplitsaresmallthejobcanbedividedamongmanymachineswhichcanprocesstheminparallel(Maitrey&Jha,2015).Thisissomethingtheuserhastosetuprightfortheusedenvironment.Normally,asplitwillbe128MB,likethesizeofaHDFSblock,whichisthedefaultsize.Ablockandasplitisnotthesamething.Ablockisaharddivisionofthedata,whichispre-set.Becausearecordofdatadoesnotalwaysfitinoneblockorcanoverlaptoanotherblock,theMapReducejobusessplitskeeptrackofthedifferentrecordstoprocess.ThenumberofsplitsforaMapReducejobcanthereforenotbepredefinedbutisentirelydependentofthesizeofthedatatobeprocessedinthejob(White,2010).

Page 17: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

9

Forbestresults,HadooptriestorunthemapphaseonthesamenodeaswheretheinputdataisstoredonHDFS.Thisway,networkbandwidthisnotwasted.IfthatnodeisbusyrunningothermaptasksHadoopchoosesanothernodeintheclusterwiththereplicateddata.Thenameforthatprocessiscalleddatalocalityoptimization.

Asmentioned,foreveryinputsplittherewillbeonemapperandthemapperoutputsakey-valuepairasaresult.Thesekey-valueoutputsarewrittentothelocaldiskonthenodeandnotonHDFS.Thisisbecausetheresultofthemaptaskisonlyfortemporaryuse,andoncethefinalresultofthereducetaskiscompleted,thatdatawillbedeletedfromthelocaldisk.Thereducephasedoesnothavethedatalocalityoptimizationbecauseareducerreceivesdatafrommultiplemapperswhichhavedatafromdifferentnodes.Afterthemapphasethekey-valuepairforeveryrecordisgroupedbythekeywithalistwiththevaluesbeforefilteredtotherightoutputinthereducephase(Turkington,2013).ThisistheprocedureforaMapReducejobwithonlyonereducerseefig.6.Ifdealingwithalargesetofdata,itisadvisabletousemorethanonereducertoavoidperformanceissuesandbottlenecks.

Thephasebetweenmapandreduceiscalledtheshufflephase.Theshufflephaseisveryimportantbecauseitiswherethekey-valuepairsaresortedandmergedtogether,creatingthekeywiththelistofvaluesthatareusedinthereducephasebeforethefinaloutput(White,2010).Itcomesintofocuswhenthejobsareusingmorethanonereducerseefig.7.

Fig 5. MapReduce flow with one reducer (White, 2010).

Page 18: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

10

Whenusingmorethanonereducer,themapperspartitionsthedata,onepartitionforeveryreducer.Thisway,intheshufflephasewhendataissortedandmergedtherightkey-valueoutputgoestotherightreducer.Withoutthepartitioningtheshufflephasewouldnotknowwhichreducerthedatashouldbesentto,whichwouldresultinhavingsamekeysspreadoutondifferentreducerswhichwouldoutputaresultforeachreducer.Thiswouldresultinanoutputthatwillbeprocessedineffectively,andwouldquitelikelybewrong(Maitrey&Jha,2015).

Fig 6. MapReduce data flow with multiple reducers (White, 2010).

2.3.4 HDFS TheHadoopdistributedfilesystemorHDFSisthesecondandotherimportantpartofHadoop,togetherwithMapReduce.HDFSisthefilesystemusedonaHadoopclusterandisusedforstoringlargesetsofdataonaclusterofdistributednodes.OneofthekeyfeaturesofHDFSisthatitisrunoncommodityhardwarenodes,whichmeansthatisnormalpricedhardwarethatcanbeboughtinmultiplevendors,andnotspeciallydesignedforHadoop.HDFSisdesignedtokeeprunningifonenodefailsbecausethatissomethingthathappensfrequentlyandmustbeaccountedfor.

Thedataprocessingisstreamprocessedwhichmeansitiswrite-once,read-many-times.Thisconceptismeantthatthedatasetsusedarewrittenonceandthenusedforprocessingovertimeandtherefortheimportanceofreadingthedatasetsisbiggerthanlowlatencyfortheoutputresults.Thehighthroughputofthedatacomeswiththecostoflatency.ThelimitforthenumberoffilesstoredinHDFSisdependantofthememorysizeofthemasternode(s),alsoknownasthenamenode.ThenamenodestorestheMetadataofthefilesysteminitsmemory(location,typeetc.)andisusedtolocatethefilesonthedifferentworkernodes(White,2010).

Page 19: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

11

Thefilesystemworksbyusingnamenodesanddatanodes(master/slave)seefig.8.InitiallythereusedtobeonenamenodeinHDFS,butinlaterversionsofHadoop,theuseofmultiplenamenodeshavebeenintroducedforstabilityandoff-loadpurposes.Todaynormally,thereareprimaryandsecondarynamenode(whichareinstandbymode).ThenamenodemanagesthefilesystemtreeandmetadataandkeepstrackofallthefilesanddirectoriesontheHDFS,whicharestoredontherestofthenodes,whicharethedatanodes.ThelimitationforstoringfilesinHDFSisthememoryonthenamenode,whichholdsthereferencetoallthefilesinthesystem.Aslongasthememoryonthenamenodeisnotfull,nodesanddiskspacecanbeaddedtothecluster,whichmakesitscalable.ThedatanodesaretheworkersintheHDFSsystemandallthedataonHDFSisstoredinblocksonthedatanodes.Assoonasachangeoccurstodataonadatanodethatisfoundinthenamespace,itisrecordedonthenamenode(Maitrey&Jha,2015).Theuseofstoringdatainblockshastheadvantagesthatfileslargerthanthedisksonthenodesdoesnotbecomeaproblem,sinceitcanbedistributedamongmultipleblocksinHDFS.Theblocksarereplicatedanddistributedontheclusterforbackupandfailureconcerns.Thedefaultreplicationfactoris3,whichmeansthatablockwillbereplicatedandspreadthreetimesonthecluster.ThismakesHDFSveryreliablewhenanodefailsorgoesdown(Apache,2016)(Srinivasa&Mupalla,2015).

Fig 7. HDFS Architecture (Srinivasa & Mupalla, 2015)

2.3.5 YARN ApacheYARNor(YetAnotherResourceNegotiator)isHadoop’sresourcemanagementsystemandwasincludedwithHadoop2.0.YARNisusedforbothresourcemanagementacrosstheclusterandalsofornodemanagement,managingtheresourcesforeverysinglenodebyusingcontainers.The

Page 20: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

12

containersareusedforrunningapplicationsontopofYARNandhavetheirownallocatedresourcessuchasCPUandmemory.ApplicationscanbebuiltonYARNusingdifferentAPIs,andYARNisoftennotsomethingthattheenduserhastobeconcernedabout.ApplicationsthatuseYARNasanintermediatelayerbetweenthemandthestoragelayerarecalledYARNapplicationsseefig.9.ThiswayYARNmakesiteasiertorunapplicationsontheclusterandseestoitthattheresourcesareallocatedright(White,2010).

Fig 8. Applications running on YARN (White, 2010) YARNconsistsof:

a. Theresourcemanagerb. Nodemanager,andc. Theapplicationmaster.

Theresourcemanagerlooksaftertheclustersresourceswhilethenodemanagerhandlesthenodesandcontainers.Theapplicationmaster(oneperapplication)negotiatesresourcesfromtheresourcemanagersoandtogetherwiththenodemanagersithelpstheapplicationrunandcomputetasks.

Theresourcemanageritselfconsistsof:

a. Thescheduler,andb. TheApplicationManager.

Thescheduleristhecomponentthatschedulesandallocatesalltheresourcesneededforrunningapplicationsbasedonwhatresourcesareneededandwhichcanbefoundinthecontainers.ApplicationManagerensuresthattheapplicationgetstherightcontainersfromtheschedulersothattheapplicationcanrun.Whilerunningitisalsoresponsibleforrestartingwhenfailuresoccur,andtotrackandmonitorthestatusoftheapplicationseefig.10(ApacheHadoop,2016).

Page 21: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

13

Fig 9. YARN components (Apache Hadoop, 2016).

2.3.5.1 Slider SliderisaYARNapplicationdevelopedtorunnon-YARNapplicationsonYARNandtosimplifythisprocess.Withslider,differentusers,usingdifferentinstancesoftheapplication,canrunanapplicationonYARN.Applicationscanalsobestopped,killedandrestartedetc.Anexampleofanon-yarnapplicationthatcanberunwithslideristheHadoopqueryenginePresto(ApacheSliderIncubator,2016)(White,2010).

2.3.6 SQL on Hadoop SQL(StructuredQueryLanguage)hasbeenusedforanalysingdatainrelationaldatabasesforalongtimeandisanindustrywidestandardlanguagewithinBusinessIntelligencetoo.InordertohelpuserswithoutsoftwareengineeringskillstouseHadooptoanalyseandqueryitsdata,differentSQLontopofHadoopsolutionshavebeencreated.Moreandmorecompetingsolutionsemergenowandthentotrytogivetheuserstheeasiestandfastestsolutiononthemarket(Turkington,2013).AccordingtoT.T.MaposaandM.Seth(2015)thefutureofSQLonHadoopisverybrightbecauseofthesimplicityandwideuserrangeofSQLandthepoweroftheHDFS,togetherformingaverysoughtaftersolution(Maposa&Sethi,2015).

2.3.6.1 Hive HiveisanopensourcedatawarehousingsolutionforHadoopandhasanSQLlikequeryinglanguageHiveQL.HivewasdevelopedatFacebookin2007andisaverybroadlyusedtoolthathasbecomeanindustrywidestandard.HivecompilestheHiveQLstatementsandintoMapReducejobsandreturnstheresultstotheuserinaSQLlikemanner.TheHiveQLlanguagehasmanyofthestandardSQL

Page 22: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

14

functionsanduserscanaddcustomizedfunctions(UDFs)toaddmorefunctionality.Primitives,strings,arrays,mapsandstructsaremanyofthedatatypessupportedinthetables,whichisoneofmanyreasonsforHive’spopularity.Hivecanberunbothinteractivelylikeanormaldbmse.g.MySQLandthroughbatchjobsinthebackgroundorafteraschedule.ThestandardinterfacesarecommandlineandawebGUIcalledhue.

ThedatamodelofhiveisorganizedintableswithacorrespondingdirectoryinHDFS,partitionsthataredividedintosubdirectoriesinthetabledirectory,andbuckets,whicharethenextlevelofpartitioningperpartitionforfasterprocessingofdata.

ThearchitectureofHiveconsistsofseefig.11:

a. Theclientforuserinteractionb. TheHiveServer2(Thrift)serverthatinterpretsthedifferentexternallanguagesagainstHive.c. TheMetastorethatisthesystemcataloguethatstorestheMetadataaboutallthe

databases,tablesandpartitionscreatedbyinHive.d. TheDriverhandlestheHiveQLstatementduringcompilationandexecutionandeverystep

betweentoandfromtheclientgoesthroughtheDriver.e. TheCompilerthattranslatestheHiveQLstatementfromtheDriverintoaplanthatsimply

speakingconsistsoftheMapReducejobsthataresubmittedtothe(Hadoop)executionengine.(Thusooetal.,2009)(White,2010)(Turkington,2013).

Page 23: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

15

Fig 10. Hive Architecture (Thusoo et al., 2009).

2.3.6.2 Presto PrestoisanSQLenginethatalsowasdevelopedbyFacebookanditsmainpurposeislow-latencyad-hocquerying.PrestousesANSISQLassyntax.ItcanrunonHadoopbutalsoonmanyotherdatabaseengines.WhenconnectedtoHadoop,PrestousestheHiveMetastoretofindthedataonHDFS.ThemaindifferencebetweenPrestoandHiveisthatPrestodoesnotprocessthequeriesusingMapReducebutinsteadallocatesallthetaskstotheworkers(nodes)memory,andschedulesnewtaskstotheworkersonceataskisfinished(Teradata,2016).ThismakesPrestomoresuitableforinteractivequerying,especiallywhentheworkersrunoutofmemorybecauseofthesizeofthedataprocessedinaquery.ThearchitectureisbuiltoftheclientthatsendsthequerytoacoordinatorthatparsestheSQLstatement,plansandschedulestheexecution.Theexecutionisthenpipelinedanddistributedtotheworkerslocatedclosestthedata.Theexecutionrunsmultiplestagessimultaneouslyanddataisstreamedbetweenallthestagesonlyusingmemoryforlow-latencyoutput(Presto,2016).

Page 24: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

16

Fig 11. Presto Architecture (Teradata, 2016) .

Page 25: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

17

3. Literature review

3.1 Purpose of literature review TherehavebeenmanySQLonHadoopbenchmarkscarriedout,totrytofigureoutwhichoneofthemanyprojectstogofor.Manyofthesebenchmarksaredonebyspecificvendorstryingtomarkettheirownproduct.Forthatreasontheyarenotalwaystrustworthywhenlookingtoimplementasolutionatanorganisation.Thevaryingresultsoftendependonthetestedinfrastructure,architectureandoptimizationsmadeforthespecificsolutions.Anotherfactorisalsowhenintimethetestswereconducted.Theopensourceprojectsareinconstantdevelopmenttoimproveandbecomemarketleaders,andnewversionsofthesolutionwillthereforebemoreoptimizedthanthepreviousones.Thiswillbeclearerafterreadingtheresultsfromthedifferentbenchmarkingattemptsthatwillbepresented.Mostofthetestedenginesinthebenchmarksbelowwillnotbeincludedinthebenchmarkingattruecallerbutwillshowthediversityofresults.ThereforIthinkitisimportanttoshowthedifferencesintheresultsandhowtheseresultscannotbetheonlysourceforchoosingandtrustingasolutionbyonlyreadingabenchmark.OnethingthatiscommoninthebenchmarksinthischapteristhatthattheyallbeatHiveinperformancebecausetheydon’tuseMapReducefordataprocessing,butmostlyprocessthedatainmemory.IhaveonlychosenPrestofromthemanyenginesthatarementionedinthissectionbecauseitmetalltherequirementsthattruecallerhad.

3.2 Benchmarks 3.2.1 Fast-data-hackathon Theallegro.techblochreportsresultsfromthefast-data-hackatonwhereagroupofuserscomparedvariousSQLonHadoopsolutionsona4nodetestclusterinJune2015.Theyused11differentqueriesintheirtests(AllegroTech,2015,June25).Thetestedsolutionswere:

- HiveonTez- Presto- Impala- Drill- Spark

Page 26: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

18

Fig 12. fast-data-hackathon results (Treasure Data Blog, 2015, March 20).

IntheresultsaboveImpalaanddrillweretheclearwinnerswhenitcomestoexecutiontimebutaccordingtothegroupconductingtheteststheybothlackedsomefunctionalitythatcouldbefoundinothersolutionssuchasHiveandPresto.

3.2.2 Renmin University In2014attheRenminUniveristyinBeijing,ChinaagroupofresearcherstestedfivedifferentSQLonHadoopsolutionstocomparethemandbecauseofHivebeingthestandardtoolandnotbeingefficientenoughforinteractivequerying.Theyused11differentquerieswithdifferentcomplexityon25,50and100nodeclusters.Thesolutionstestedwere:

- Impala- Presto- Hive(toshowdifference)- Stinger- Shark

Page 27: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

19

Fig 13. Renmin University results for 100-node cluster (Chen et al., 2014).

Theresultsshownotonlythedifferentresponsetimes,butalsothatsomequerieswerenotabletorunonallsolutionsexceptHive.Accordingtotheteststheresearchersconcludedthatnoneofthesolutionswerematureenoughtobeusedonadailybasis(Chenetal.,2014).

3.2.3 Commercial benchmarks

3.2.3.1 Pivotal (HAWQ) PivotalthatdevelopedtheSQLonHadoopsolutionHAWQbenchmarkedtheirproductagainstCloudera’sImpala,PrestoandStingerona8nodeclusterinJune2015.

Page 28: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

20

Fig 14. Hawq’s speed up ratio vs. Impala (Soliman et al., 2014).

TheresultsinthegraphshowshowHAWQbeatsImpalainthemajorityofthequeriesruninthetestcases(Solimanetal.,2014)(Pivotal,2014,June25).ItisworthhighlightingthatImpalahasbeentheclearwinnerwhenitcomestospeedinallthegraphsabove.

3.2.3.2 Cloudera (Impala) InFebruary2016,ClouderareleasedtheirlatestImpalabenchmarkcomparingImpalatoHiveonTezandSparkona21nodecluster.

Fig 15. Cloudera’s Impala benchmark (Cloudera, 2016, February 11).

Page 29: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

21

ThegraphaboveisoneofmanygraphsonCloudera’sblogshowinghowImpalabeatsitscompetitorsinallways(Cloudera,2016,February11).Itisworthnotingthatthisisanormalscenariocomingfromthevendorordeveloperstryingtoprovetheefficiencyoftheirproducts.

3.3 Hive not meant for low-latency querying BecausehiveisbasedonHDFSandHadoop,itismeantforbatchjobsfocusingonhighthroughputandscalabilityandcanbeusedforbothanalysisandETL.BecauseofHive’shighfaulttoleranceandwideusability,ithasbecomeandindustrywidestandardasadatawarehousetoolforHadoop(Turkington,2013).Italsothereforpreferredforbeingreliableandstableandotherlow-latencysolutionsareusedonthesideforad-hocquerying(Cloudera,2016).

Page 30: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

22

4. Research methodology ThisthesiswillbethestartofaprocesstoexploreandtestSQLonHadoopqueryengines.ThescopeofthisthesiswillbetheevaluationofthequeryenginePresto.Prestowillbeevaluatedagainsthiveandthemainfocuswillbeonthelatencytimeoftheusecasequeries,Secondlyitwillalsoevaluatethetypeofusecasesitissuitableforandidentifyingandcomparingtheprosandconsagainstthebaselinesolutionhive.

Themethodologyusedinthisthesiswillbedesign-scienceandthedesign-scienceresearchprocessinterpretedbyPeffersetal.(2007)asthestepsinvolvedaresuitedforidentifying,implementingandevaluatingmychosensolution.Themainfocususingthismethodologywillbeontheevaluationofthesolutionratherthanbuildinganartefact,focuswillbeondeployingandconfiguringPrestothatisanalreadydevelopedopensourcequeryengine.Definingtherightusecaseswillalsobeabigpartofthedesign.

Thestepsindesign-scienceresearchprocessare:

1. Identifyproblem&Motivate.2. DefineObjectivesforsolution.3. Design&Developement.4. Demonstration.5. Evaluation.6. Communication.

Fig. 16 Design Science Research Methodology Peffers, Tuunanen, Rothenberger & Chatterjee, 2007).

4.1 Identify problems & Motivate Thefirststepinthedesign-scienceresearchprocessistoidentifyandmotivatetheproblems.Attruecallerthemainproblemisthelatencyandexecutiontimeofhivewhenrunningad-hocqueries.

Problems:

1. Hive/MapReduceistooslowfordatadiscovery/ad-hocqueries.

Page 31: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

23

Asnotedearlier,theproblemwiththecurrentsetupattruecalleristhatthedatawarehouseisaccessedthroughHiveusingMapReduceonHadoop.Runningcomplexqueriestakesoftenunnecessarylongtimeandthisisbothbadfortheanalyticsteamandforthestakeholdersthatrequestad-hocanalysisonaregularbasis.MapReduceisbetterforbatchjobsthatcanbescheduledforstandardanalysisratherthanad-hocqueryinganddatadiscovery.HiveandMapReducetakestimebecauseoftheprocessingisdonebywritingtheoutputondisk.

4.2 Define Objectives for solution Theobjectivesforthesolutionwillinthiscasebetherequirementsforidentifyingthenewsolution.

Requirements:

Table 4. Requirements

Beforetryingtoidentifythedifferentsolutions,discussionswillbeheldwiththeendusersattheanalyticsteamtogatherrequirementsregardingthesolution.Therearemanyopensourcesolutionsthatcanbetested,andwiththerightrequirements,thelistcanbenarroweddown.

Heretheusecaseswillbedefinedbasedoneverydayad-hocanalysiscarriedoutbytheanalyticsteam.Theusecaseswillincrementallybecomemorecomplextocomparethetimeandresourcesusedbythesolutionswitheachother.AninitialbaselinewheretheusecasesareruninHivewouldbethesecondstepsothatthesolutionscanbecomparedwiththecurrentsituation.AlltheusecaseswillbetestedinsolutionssupportingSQL-syntaxandthequerieswillbethesameorvaryinnativesyntax.(Eg.HiveqldoesnotsupportallstandardSQLfunctionsandSQLcanalsovarydependingontheengine/dialect.)

4.3 Design & Development Inthisstepdesigninganddevelopingandartefactisthecommonpracticeindesign-science.Inthiscase,researchingandIdentifyingthesolutionwillbethepre-stepbeforedeployingthesolutiononfirstatest,andafterontheproductionenvironment(Hadoopcluster).

PrestowasthechosensolutionbecauseitmetalltherequirementsthatwasmentionedaboveandwaseasyenoughtodeployusingtheYARNapplicationversion.

Thefirststepwillbetotestthesolutiononasmallertestenvironmentbeforeimplementingittotheproductionenvironment.ForPresto,thetestenvironmentwillbeadockercontainer,whichisatype

1.Opensource2.SQLinterface3.SupportstoragefiletypeORC4.SupportsamedatatypesasHivee.g.structs,array,Map5.UDF/UDA(UserDefinedFunctions/Aggregations)6.Standardanalyticalfunctions7.Windowfunctions

Page 32: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

24

ofvirtualenvironment.ThecontainerwillhaveaonenodeHadoopclusterinstalledandreadytoteston.Thetestsetupwasinstallingthesolution,importingadatasetandtestrunningsomerandomqueries.

4.4 Demonstration Thisstepwillbeforpreparingtheevaluationofthesolution.Inthiscasedefiningtheusecasesfortheevaluationisthemaingoalofthisstep.

Theimplementationofthesolutionwillnotbeheavilydocumented,eventhoughabigpartofthetimeinthisprojectwillbespentondeployingandconfiguringthesolutiontothecurrentenvironmentattruecaller.Thetestingintheproductionenvironmentwillbebasedontheusecasesthatwerepre-defined.ThisstepwillshowhowthesolutiondiffersfromHive,andwhattheprosandconsfortheparticularqueriesare.Theusecaseswillberun4times.2timesduringworkhourswhenthereisnormalloadontheclusterand2timesduringtheevening.Theaveragetimeinsecondswillbeusedtoshowtheresultofthelatencytime.

4.5 Evaluation Thisstepistoevaluatetheresultsofthesolution.

Theresultsoftheusecaseswillberecordedandvisualizedusingbarchartstoshowthedifferenceinlatencytimebetweenthedifferentsolutions.Theresultchapterofthethesiswillbemainlyfocusingonthisstepofthemethodology.Otherconclusionsfromtheusecaseswillalsobedrawnfromthebaselineandthenewsolutione.g.prosandconsandsuitableusecases.

4.6 Communication Thethesisitselfrepresentsthisstep.Togetherwiththethesis,theconclusionsandrecommendationsforfurthertestingattruecaller,apresentationatLTUscampuswillbeheldforsupervisorsandotherstudents.

Page 33: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

25

5. Result

5.1 Cluster details TheHadoopclusterattruecallerconsistsof14nodeswithfollowingcomponents:

OS Linux(debian-kernel)3.2.0.4-amd64CPUmodel Intel(R)Xeon(R)[email protected] 32RAM 128GBHDD ~20TB

Table 5. Cluster details

5.2 Implementation Prestowasdeployedontruecaller’sHadoopclusterusingaversionthatallowsittorunasaYARNapplicationusingall14nodes.OnenodesharedasbothCoordinatorandWorkerandtherestasWorkers.Thewholesetupandconfigurationcanbefoundunderappendices7.1Prestoinstallation.

5.3 Use cases and latency in seconds 1.Select*withrestrictionandLimitationtype1.Tablesize14Billionrows.

Hive:~20sec

Presto:~5sec

Conclusion:

Presto4xfaster.

2.Select*withrestrictionandLimitationtype2.Tablesize14Billionrows.

Hive:~40sec

Presto:~7sec

Conclusion:

Presto~6xfaster.

3.Select*withrestrictionandLimitationtype3.Tablesize14Billionrows.

Hive:~200sec

Presto:~15sec

Conclusion:

Presto~13xfaster.

4.Selectcountwithrestriction.Tablesize14Billionrows.

Page 34: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

26

Hive:~90sec

Presto:~23sec

Conclusion:

Presto~4xfaster.

5.Select*withrestriction,limitationandOrderBy.Tablesize14Billionrows.

Hive:~340sec

Presto:RunsoutofmemorybecauseofOrderBy,tooexpensive

Conclusion:

Notsuitableusecaseforpresto.

6.Selectwith2aggregatefunctions.Tablesize14Billionrows.

Hive:~600sec

Presto:~120sec

Conclusion:

Presto~5xfaster.

7.Selectwithaggregates,groupbyandlimitationtype1.Tablesize14Billionrows.

Hive:~120sec

Presto:~40sec

Conclusion:

Presto~3xfaster.

8.Selectwithaggregates,groupbyandlimitationtype2.Tablesize14Billionrows.

Hive:~120sec

Presto:~50sec

Conclusion:

Presto~2,5xfaster.

9.Selectwithmultipleaggregates,groupby,limitationsandJointable3times.Tablesize14Billionrows.

Hive:~6000sec

Presto:~7000sec

Conclusion:

Hive1,2xfaster.

Page 35: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

27

5.4 Result table

Fig 17. Query execution time in seconds

20 5 40 7 200 15 90 23 340 600 120 120 40 120 50 600070000

50

100

150

200

250

300

350

400

450

500hive

presto

hive

presto

hive

presto

hive

presto

hive

presto

hive

presto

hive

presto

hive

presto

hive

presto

q1 q2 q3 q4 q5 q6 q7 q8 q9

Second

s

Page 36: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

28

6. Discussion

6.1 Presto at truecaller LookingatthequerieswherePrestoperformedfasterthanHive,theaveragegainoflatencytimeisaround5timesfaster.Thisisabitlowerthansomeexamplebenchmarksthatcanbefoundonline.Howeveritisstillabiggainwhenneededtorunqueriesoverandoveragainduringwork.Prestoshowedclearlythatitismoresuitedforreturningfastinsightsonasmallerscaleandissuitedforad-hocanalysisforsomeonewhoisfamiliarwiththedatasetsandknowswhattoandwheretolookfor.Itislesssuitedformoreexpensivequerieswheretheworkerneedstostorewithlargeamountsofmetadata,andriskstorunoutofmemory.Intheconfigurationandtesting,allocating70GBofmaximumquerymemorywasnotenoughforquery5whichcontainedanorderbyoperation,andranoutofmemory.Prestoisalsonotsuitedformultiplejoinsbetweenlargetables,whereitperformedslowerthanHive.Intruecaller’scasePrestoisfineformanyofthecommonday-to-dayad-hocusecases.Attruecallermultipletoolsareusedwithintheanalyticsteamandknowingtheprosandconsofthetoolshelpstheuserstochoosewhichonetogofor.ForsomeETLjobs,Sparkisbeingusedandtherefortheteamwillnotsufferfromaddinganothertooltotheirtoolbox.Prestohassomeminorsyntaxdifferencesinbuttheyareeasytolearnandshouldnotbeseenasanobstacle.AnotherfindingfromthisresearchisthatPrestocanbeconfiguredandtunedonverydetailedlevel.Theresultspresentedinthegraphareafternumerousiterationsofconfigurationsandtuning.Thereisalsothestandardversion,whichrequiresmoreconfigurationsonanoperatingsystemlevel.TheversiontestedattruecallerwastheYARNapplicationversion.ThereforethereismoreroomfortuningandtestingPrestobeforechoosingthefinalsettingsfortruecaller’sHadoopclusterandenvironment.MyrecommendationwouldbeaftertuningandconfiguringPresto,tousePrestoforsomemorelightweightqueriestogetherwithHiveforthebiggerjobs(ETL,batchetc.).

6.2 Use cases TheusecaseschosenfortheevaluationarebasedonworkIdidfortruecalleronthesideofthethesiswork.Thequeriesusedareactualad-hocrequestsfortheanalyticsteam,whichItookupontolearnmoreaboutthedatathatwasanalysedonadailybasis.ThatwayIgotagoodpictureaboutwhattypesofquerieswereoftenrunandwillberuninanewsolutionifimplemented.

6.3 Limitations Thegoalfromthebeginningwastoevaluate2-3solutionsandcomparethemtoHivelikeIdidwithPresto.Duetosomeresourcelimitationsandthecomplexityoftheproject,Icouldnotproceedwithtestingmoresolutions.Thishadmostlytodowithmenotriskingcompromisingthedailyworkbytryingtoimplementnewsolutionsontheproductionclusterwithoutseniorsupervision.

6.4 Commercial alternatives Thereasontotestopensourcesolutionswasasearliermentionedtokeepthecostsdown.ThereareotherwisecommercialalternativesthatperformverywellandrunstablewithHadoop.AnexampleisExasolanin-memoryanalyticsdatabase.IhavetestedandworkedonExasolinotherprojectsandwouldrecommenditasanlow-latencysolutionforcompaniesthatcanconsiderpayingforsuchasolution.

Page 37: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

29

6.5 Further research TuningandconfiguringPrestowithprofessionalhelpwouldbeafirststeptoseehowmuchfasterPrestocouldbecome.Itwouldalsobeinterestingtoseeifthereareanygainsregardingheavierqueriesandjobscomparedtothecurrentconfigurationsattruecaller.AnextstepwouldbetoseeifgainsinexecutiontimewouldbepossiblebytryingtotunetheparametersoftheORCfile.Thiswasanotherintentiontotestduringthisresearchbutcouldnotbedoneduetotimelimitations.

ApacheDrillandApacheTajoaretheothertwosolutionsthatweremeanttobetestedandevaluatedinthisthesiswork.Tajowasactuallyimplementedbutdidnotsupportsomeoftherequiredfiletypesanddatatypes.IhavebeeninclosecontactwiththedevelopersofTajoanditshouldmeetalloftruecaller’srequirementsinthenextrelease,whichwasinitiallyplannedforsometimeinApril2016,butwasdelayed.Afterthereleaseandifthereistimeandresources,lookingintobothofthesesolutionscouldbeofinterestfortruecaller.

Ifthereareadditionalresourcesandtime,IwouldrecommendtestingImpalabychangingfileformatsfromORCtoParquetanddorunthesametypeoftests.Becauseoftheresultsfrommostofthebenchmarksreadduringthisresearch,ImpalaisproventobeastrongcandidateforalightweightSQLonHadooptool.Theonlyproblemisthedifferenceinthefiletypes(ORCvs.Parquet)thattruecallerhaschosentostore.

Page 38: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

30

7. Conclusion TherearemanySQLonHadoopsolutionsonthemarkettoday.PrestoisonoftheknownnamesofopensourceSQLonHadoopsolutionsandwaschosenasthetooltobetestedataftermeetingalltherequirements.SinceonlyPrestowastested,duetolimitations,itcannotbesaidthatitistheonlyand/orperfectsolutionfortruecaller’scase.Howeverithasansweredtheresearchquestionsofthisthesisbyprovingtobeonaverage5timesfasterthanHive.Prestowasprovenworkbestforlighterad-hocanalysisrunningonsmallerdatasetsorsampledatatakenformlargersetstoproveapoint.Prestodidnotworkwellwithlargerdatasetsandheavierqueriescontainingmultiplejoins/selfjoins.Prestocanbetunedandoptimizedtofittheparticularenvironmenttoperformatitsbest.Theuse-casesandalltestingweredoneattruecallerinStockholmandtheresultsarealldependingontheconfigurationandhardwareofthetruecaller’sHadoopcluster.Prestoisnowinusebytheanalyticsteamattruecallerforspecificad-hocusecasesandwillnotpermanentlyreplaceHiveoranyothertoolatthecompany.User’sthatknowthedatawillbeabletomaketherightdecisionforwhentoandnottousePresto.UsingPrestofortherightuse-caseswillsavetruecaller’sanalyticsteamtimeandwilladdefficiencytotheirwork,whichiswasthemainpurposeforthemtowanttotestanothersolution.BydeployingPrestoattruecaller,thisgoalhasbeenreachedfornow.

Page 39: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

31

8. References

8.1 Books and articles • White,T.(2010).Hadoop:thedefinitiveguide.(2nded.)Farnham:O'Reilly.• Chen,Y.,Qin,X.,Bian,H.,Chen,J.,Dong,Z.,Du,X.,&Zhang,H.(2014).AStudyofSQL-on-

HadoopSystems.LectureNotesInComputerScience,(8807),154-166.• Walker,R.(2015).FromBigDatatoBigProfits:SuccesswithDataandAnalytics.Oxford

UniversityPress.• Gandomi,A.,&Haider,M.(2015).Beyondthehype:Bigdataconcepts,methods,and

analytics.InternationalJournalOfInformationManagement,35137-144.doi:10.1016/j.ijinfomgt.2014.10.007

• NationalInstituteofStandardsandTechnology.(2015)BigDataInteroperabilityFramework:Volume1,DefinitionsU.S.DepartmentofCommerce.

• Turkington,G.(2013).HadoopBeginner'sGuide.Birmingham:PacktPublishing.• Cervone,F.(2015).BIGDATAANDANALYTICS.OnlineSearcher,39(6),22-27.• Maitrey,S.,&Jha,C.(2015).MapReduce:SimplifiedDataAnalysisofBigData.Procedia

ComputerScience,57(3rdInternationalConferenceonRecentTrendsinComputing2015(ICRTC-2015),563-571.doi:10.1016/j.procs.2015.07.392

• Jukić,N.,Sharma,A.,Nestorov,S.,&Jukić,B.(2015).AugmentingDataWarehouseswithBigData.InformationSystemsManagement,32(3),200-209.doi:10.1080/10580530.2015.1044338

• Thusoo,A.,Sarma,J.,Jain,N.,Shao,Z.,Chakka,P.,Anthony,S.,&...Murthy,R.(2009).Hive-Awarehousingsolutionoveramap-reduceframework.ProceedingsOfTheVLDBEndowment,2(2),1626-1629.

• Connolly,T.M.&Begg,C.E.(2005).Databasesystems:apracticalapproachtodesign,implementationandmanagement.(4.,[rev.]ed.)Harlow:Addison-Wesley.

• Maposa,T.T.,Sethi,M.(2015).SQL-on-Hadoop:TheMostProbableFutureinBigDataAnalytics.AdvancesinComputerScienceandInformationTechnology(ACSIT),2(13),9-14.

• Srinivasa,K.G.,Mupalla,A.K.(2015).GuidetoHighPerformanceDistributedComputing,casestudieswithHadoop,ScaldingandSpark,Springer

• Soliman,M.,Petropoulos,M.,Waas,F.,Narayanan,S.,Krikellas,K.,&Baldwin,R.etal.(2014).Orca.ProceedingsOfThe2014ACMSIGMODInternationalConferenceOnManagementOfData-SIGMOD'14,337-348.http://dx.doi.org/10.1145/2588555.2595637

• Choudhury,K.(2015,October12).Truecaller'sIndiauserbasegrowsto100mn,BusinessStandard,NewDelhi.Retrievedfromhttp://www.business-standard.com/article/companies/truecaller-s-india-user-base-grows-to-100-mn-115101200037_1.html

• Rukmini,S.(2014,December13).Indialosingwaronspam,TheHindu.Retrievedfromhttp://www.thehindu.com/sci-tech/technology/10-lakh-complaints-about-spam-calls-in-the-last-three-years/article6689376.ece

• Peffers,K.,Tuunanen,T.,Rothenberger,M.A.,&Chatterjee,S.(2007).ADesignScienceResearchMethodologyforInformationSystemsResearch.JournalOfManagementInformationSystems,24(3),45-77.

Page 40: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

32

8.2 Websites and blog posts • Truecaller.(2016).MAKINGYOURPHONESMARTER.Retrievedfrom

https://www.truecaller.com/about• Apache.(2016).HadoopWiki,Projectdescription.Retrievedfrom

http://wiki.apache.org/hadoop/ProjectDescription• ApacheHadoop.(2016).ApacheHadoopYARN.Retrievedfrom

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html• ApacheSliderIncubator.(2016).ApacheSliderArchitecture.Retrievedfrom

https://slider.incubator.apache.org/design/architecture.html• Teradata.(2016).Presto,aFaster,MoreScalable,MoreFlexibleOpenSourceSQLonHadoop

Engine.Retrievedfromhttp://assets.teradata.com/resourceCenter/downloads/Datasheets/EB8901.pdf?processed=1

• Presto.(2016).Overview.Retrievedfromhttps://prestodb.io/overview.html• Cloudera.(2016).Whatishive?Retrievedfrom

https://archive.cloudera.com/cdh4/cdh/4/hive/• AllegroTech.(2015,June10).FastDataHackathon[Blogpost].Retrievedfrom

http://allegro.tech/2015/06/fast-data-hackathon.html• Pivotal.(2014,June25).PivotalHAWQBenchmarkDemonstratesUpTo21xFaster

PerformanceonHadoopQueriesThanSQL-likeSolutions[Blogpost].Retrievedfromhttps://blog.pivotal.io/big-data-pivotal/products/pivotal-hawq-benchmark-demonstrates-up-to-21x-faster-performance-on-hadoop-queries-than-sql-like-solutions

• Cloudera.(2016,February11).NewSQLBenchmarks:ApacheImpala(incubating)UniquelyDeliversAnalyticDatabasePerformance[Blogpost].Retrievedfromhttps://blog.cloudera.com/blog/2016/02/new-sql-benchmarks-apache-impala-incubating-2-3-uniquely-delivers-analytic-database-performance/

• Fishman,J.(2014,July14).AppleSolarFrenzy:BenefitingShareholdersandthePlanetAlike[Blogpost].Retrievedfromhttp://solarstockideas.com/apple-solar-frenzy-benefiting-shareholders-planet-alike/

Page 41: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

33

9. Appendices

9.1 Appendix A – Presto installation

Guide to install presto on hadoop cluster.

On docker container with cloudera one node cluster for test: Prerequisits:

§ HDFS§ YARN§ ZooKeeper§ Java§ Maven

Step 1. Install Slider to run presto application on yarn:

1. Create yarn user on containers hdfs with:

hdfs dfs -mkdir -p /user/yarn hdfs dfs -chown yarn:yarn /user/yarn

1. Downloadslider0.80.0foryarn(http://apache.mirrors.spacedump.net/incubator/slider/0.80.0-incubating/slider-assembly-0.80.0-incubating-all.tar.gz)

2. Untarfilewithcommand:tar -xvf slider-0.80.0-incubating-all.tar.gz

3. Opennewdirectoryandgotoslider-assembly/targetdirectoryandcopy0.80.0-incubating-all.ziptoanewfolderwhereyouwanttoinstallslider.

4. Unzipfileinfolder,thesliderfolderwillnowbecreated.5. Gotosliderfolder/confandopenfileslider-env.shsetenvvariables:

export JAVA_HOME=/usr/lib/jvm/java-8-oracle/ export HADOOP_CONF_DIR=/etc/hadoop/conf

6. Open file slider-client.xml change set values:

<property> <name>hadoop.registry.zk.quorum</name> <value>localhost:2181</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>localhost:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>localhost:8030</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:8020</value> </property>

7. Test slider with command :

${slider-install-dir}/slider-0.80.0-incubating/bin/slider version

Page 42: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

34

If you get output "Compiled against Hadoop 2.6.0", slider is installed right.

Step 2. Install presto:

1. Download presto-yarn package from github:(gitclonehttps://github.com/prestodb/presto-yarn.git)

2. Buildpackagewithmaven:mvn clean package

3. Go to folder created and folder presto-yarn-package/target and copy presto-yarn-package-1.0.0-SNAPSHOT-0.130.zip into slider folder.

4. Gobacktopresto-yarn-packagefolderandto/src/main/resourcesandcopyfilesappConfig.jsonandresources-singelnode.jsonintosliderfolder.

5. Insliderolderrenameresources-singlenode.jsontoresources.jsonandopenitandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},"global":{"yarn.vcores":"1"},"components":{"slider-appmaster":{},"COORDINATOR":{"yarn.role.priority":"1","yarn.component.instances":"1","yarn.memory":"1500"}}}

6. OpenappConfig.jsonandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},"global":{"site.global.app_user":"yarn","site.global.user_group":"hadoop","site.global.data_dir":"/var/lib/presto/data","site.global.config_dir":"/var/lib/presto/etc","site.global.app_name":"presto-server-0.130","site.global.app_pkg_plugin":"${AGENT_WORK_ROOT}/app/definition/package/plugins/","site.global.singlenode":"true","site.global.coordinator_host":"${COORDINATOR_HOST}","site.global.presto_query_max_memory":"50GB","site.global.presto_query_max_memory_per_node":"512MB","site.global.presto_server_port":"8080",

Page 43: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

35

"site.global.catalog":"{'hive':['connector.name=hive-cdh5','hive.metastore.uri=thrift://${NN_HOST}:9083'],'tpch':['connector.name=tpch']}","site.global.jvm_args":"['-server','-Xmx1024M','-XX:+UseG1GC','-XX:G1HeapRegionSize=32M','-XX:+UseGCOverheadLimit','-XX:+ExplicitGCInvokesConcurrent','-XX:+HeapDumpOnOutOfMemoryError','-XX:OnOutOfMemoryError=kill-9%p']","application.def":".slider/package/PRESTO/presto-yarn-package-1.1-SNAPSHOT-0.130.zip","java_home":"/usr/lib/jvm/java-8-oracle/"},"components":{"slider-appmaster":{"jvm.heapsize":"128M"}}}Testdifferentsettingsforenvironmentnodememory1024MBwastoomuchforusandwechangedto:site.global.presto_query_max_memory_per_node":"512MB"

7. Insliderfolderchangeusertoyarn(suyarn)andruncommandtoinstallpresto:

bin/slider package --install --name PRESTO --package presto-yarn-package-*.zip

8. Run presto with command:

bin/slider create presto1 --template appConfig.json --resources resources.json

If exited with status 0 presto app shoul now be running.

9. Go to YARN resourcemanager web UI to check if presto is running (http://science2.truecaller.net:8088) : (See first row application presto1 running).

10. Create new folder and download presto client jar file (https://repository.sonatype.org/service/local/artifact/maven/content?r=central-proxy&g=com.facebook.presto&a=presto-cli&v=RELEASE)

11. Rename jar file to to only presto (mv *.jar presto). Run command to make it executable:

chmod +x presto

12. To test presto go to hive client and create a table with data. Now go back to folder with presto as yarn user (su yarn) and type command to run presto:

./presto --server localhost:8080 --catalog system --schema default (or which schema wanted to use)

type show tables; if table is showing, tryout a query.

On hdp3 cluster:

Page 44: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

36

Step 1. Install Slider to run presto application on yarn:

1. Create yarn user on containers hdfs with:

hdfs dfs -mkdir -p /user/yarn hdfs dfs -chown yarn:yarn /user/yarn

1. Downloadslider0.80.0foryarn(http://apache.mirrors.spacedump.net/incubator/slider/0.80.0-incubating/slider-assembly-0.80.0-incubating-all.tar.gz)

2. Untarfilewithcommand:tar -xvf slider-0.80.0-incubating-all.tar.gz

3. Opennewdirectoryandgotoslider-assembly/targetdirectoryandcopy0.80.0-incubating-all.ziptoanewfolderwhereyouwanttoinstallslider.

4. Unzipfileinfolder,thesliderfolderwillnowbecreated.5. Gotosliderfolder/confandopenfileslider-env.shsetenvvariables:

export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-x64/ export HADOOP_CONF_DIR=/etc/hadoop/conf

6. Open file slider-client.xml change set values:

<property> <name>hadoop.registry.zk.quorum</name> <value>---------------- </property> <property> <name>yarn.resourcemanager.address</name> <value>-----</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value> </value> </property> <property> <name>fs.defaultFS</name> <value>hdfs---8020</value> </property>

7. Test slider with command :

${slider-install-dir}/slider-0.80.0-incubating/bin/slider version

If you get output "Compiled against Hadoop 2.6.0", slider is installed right.

Step 2. Install presto:

1. Download presto-yarn package from github:(gitclonehttps://github.com/prestodb/presto-yarn.git)

2. Buildpackagewithmaven:mvn clean package

3. Go to folder created and folder presto-yarn-package/target and copy presto-yarn-package-1.0.0-SNAPSHOT-0.130.zip into slider folder.

4. Gobacktopresto-yarn-packagefolderandto/src/main/resourcesandcopyfilesappConfig.jsonandresources-singelnode.jsonintosliderfolder.

5. Insliderolderrenameresources-singlenode.jsontoresources.jsonandopenitandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},

Page 45: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

37

"global":{"yarn.vcores":"8"},"components":{"slider-appmaster":{},"COORDINATOR":{"yarn.role.priority":"1","yarn.component.instances":"1","yarn.component.placement.policy":"1","yarn.memory":"4000"},"WORKER":{"yarn.role.priority":"2","yarn.component.instances":"13","yarn.component.placement.policy":"1","yarn.memory":"8000"}}

6. OpenappConfig.jsonandeditsettings:{"schema":"http://example.org/specification/v2.0.0","metadata":{},"global":{"site.global.app_user":"yarn","site.global.user_group":"hadoop","site.global.data_dir":"/var/lib/presto/data","site.global.config_dir":"/var/lib/presto/etc","site.global.app_name":"presto-server-0.142","site.global.app_pkg_plugin":"${AGENT_WORK_ROOT}/app/definition/package/plugins/","site.global.singlenode":"true","site.global.coordinator_host":"${COORDINATOR_HOST}","site.global.presto_query_max_memory":"50GB","site.global.presto_query_max_memory_per_node":"4GB","site.global.presto_server_port":"8080","site.global.catalog":"{'hive':['connector.name=hive-cdh5','hive.metastore.uri=thrift://hdp3.truecaller.net:9083'],'tpch':['connector.name=tpch']}","site.global.jvm_args":"['-server','-Xmx8192M','-XX:+UseG1GC','-XX:G1HeapRegionSize=32M','-XX:+UseGCOverheadLimit','-XX:+ExplicitGCInvokesConcurrent','-XX:+HeapDumpOnOutOfMemoryError','-XX:OnOutOfMemoryError=kill-9%p']","application.def":".slider/package/PRESTO/presto-yarn-package-1.1-SNAPSHOT-0.142.zip","java_home":"/usr/lib/jvm/jdk-8-oracle-x64/"},"components":{"slider-appmaster":{"jvm.heapsize":"128M"

Page 46: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

38

}}}

7. Insliderfolderchangeusertoyarn(suyarn)andruncommandtoinstallpresto:

bin/slider package --install --name PRESTO --package presto-yarn-package-*.zip

8. Run presto with command:

bin/slider create presto1 --template appConfig.json --resources resources.json

If exited with status 0 presto app shoul now be running.

9. Go to YARN resourcemanager web UI to check if presto is running (See first row application presto1 running).

10. Create new folder and download presto client jar file (https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.142/presto-cli-0.142-executable.jar)

11. Rename jar file to to only presto (mv *.jar presto). Run command to make it executable:

chmod +x presto

12. To test presto go to hive client and create a table with data. Go to the resource manager web interface.find presto1 under running applications and find URL to coordinator node.

13. Now go back to folder with presto as yarn user (su yarn) and type command to run presto:

./presto --server h[URL to coordinator]:8080 --catalog hive --schema default (or which schema wanted to use)

typeshowtables;iftableisshowing,tryoutaquery

9.2 Appendix B – Use case queries

1.SELECT*withrestrictionANDLIMIT1:

SELECT*

FROMapp_events

WHEREUPPER(user.country_code)='IN'

ANDdt=20160221

ANDh=02

Page 47: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

39

LIMIT10000;

2.SELECT*withrestrictionANDLIMIT2:

SELECT*

FROMapp_events

WHEREdt=20160221

ANDappsearch.typeIN('1','2')

LIMIT10000;

3.SELECT*withrestrictionANDLIMIT3:

SELECT*FROMapp_events

WHEREuser.register_id=2107422

ANDappsearch.type='2'

ANDdtbetween20160123AND20160125

LIMIT50;

4.SELECTCOUNT(*)withrestriction

SELECTCOUNT(*)FROMapp_events

WHEREupper(user.country_code)='IN'

ANDappsearch.type='2'

ANDdt=20160123;

5.SELECT*withrestrictionLIMITANDORDERBY:

SELECT*

FROMapp_events

WHEREdt=20160221

Page 48: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

40

ANDuser.country_codeISNOTNULL

ANDappsearch.typeIN('1','2')

ORDERBYUPPER(user.country_code)

LIMIT10000;

6.SELECT*withrestrictionLIMITANDSORTBY:

SELECT*

FROMapp_events

WHEREdt=20160221

ANDh=02

ANDuser.country_codeISNOTNULL

ANDappsearch.typeIN('1','2')

SORTBYUPPER(user.country_code);

7.SimpleSELECTwithaggregates:

SELECT

SUM(appsearchv2.attempts[0].latency)sum_latency,

AVG(appsearchv2.attempts[0].latency)avg_latency

FROMdata.app_events

WHEREsize(appsearchv2.attempts)>0

ANDdtbetween20160130AND20160202

ANDupper(user.country_code)='IN'

ANDappsearchv2.attempts[0].endpointnotlike'%localhost%';

8.SELECTwithaggregatesANDGROUPBY:

SELECTuser.register_id,

appsearch.searchrequest.normalizedphonenumberasincoming_nr,

Page 49: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

41

SUM(if(appsearch.type=2,1,0))incoming_calls,

SUM(if(appsearch.type=5,1,0))sms

FROMdata.app_events

WHEREdt=20160121

ANDappsearch.searchrequest.searchcountrycode='SE'

ANDappsearch.searchrequest.normalizedphonenumberISnotNULL

GROUPBYuser.register_id,

appsearch.searchrequest.normalizedphonenumber

LIMIT10000;

9.SELECTwithaggregateswithgroupANDorderby:

SELECTuser.register_id,

appsearch.searchrequest.normalizedphonenumberasincoming_nr,

SUM(if(appsearch.type=2,1,0))incoming_calls,

SUM(if(appsearch.type=5,1,0))sms

FROMdata.app_events

WHEREdt=20160121

ANDappsearch.searchrequest.searchcountrycode='SE'

ANDappsearch.searchrequest.normalizedphonenumberISnotNULL

GROUPBYuser.register_id,

appsearch.searchrequest.normalizedphonenumber

orderbyuser.register_id,

incoming_nr

LIMIT10000;

10.SELECTFROMinnerquerywithinnerGROUPBYANDouterorderby:

Page 50: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

42

SELECTregister_id,

incoming_nr,

incoming_calls,

sms

FROM(

SELECTuser.register_id,

appsearch.searchrequest.normalizedphonenumberasincoming_nr,

SUM(if(appsearch.type=2,1,0))incoming_calls,

SUM(if(appsearch.type=5,1,0))sms

FROMdata.app_events

WHEREdt=20160125

ANDappsearch.searchrequest.searchcountrycode='SE'

ANDappsearch.searchrequest.normalizedphonenumberISnotNULL

GROUPBYuser.register_id,

appsearch.searchrequest.normalizedphonenumber)x

WHEREincoming_calls>0ORsms>0

orderbyregister_id

LIMIT10000;

11.JOINWITH2INNERQUERIESWITHAGGREGATESANDORDERBY:

SELECT

a.city,

COUNT(DISTINCTa.registerid)nr_IOS_profiles

FROM(

SELECT

LOWER(city)city,

registerid

FROMdata.user_profile_data

WHERELOWER(city)<>''

Page 51: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

43

ANDdt=20160221

ANDregisterid<>0)a

INNERJOIN(

SELECT

id

FROMuserprimaryserviceprovider

WHERELOWER(os)LIKE'iphone%'

ANDdt=20160221)b

ONb.id=a.registerid

GROUPBYa.city

ORDERBYnr_IOS_profilesDESC

LIMIT10;

12.3joinswithaggregatesANDsortby:

SELECT

FROM_UNIXTIME(UNIX_TIMESTAMP()-86400,'yyyyMMdd')ASdt,

a.countrycode,

a.incoming_phonenr,

COALESCE(c.incoming_calls_24h,0L)incoming_calls_24h,

COALESCE(c.sms_24h,0L)sms_24h,

COALESCE(b.incoming_calls_3d,0L)incoming_calls_3d,

COALESCE(b.sms_3d,0L)sms_3d,

a.incoming_calls_7d,

a.sms_7d

FROM

(SELECT

countrycode,

incoming_phonenr,

incoming_calls_7d,

Page 52: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

44

sms_7d

FROM(

SELECT

appsearch.searchrequest.searchcountrycodeascountrycode,

appsearch.searchrequest.normalizedphonenumberASincoming_phonenr,

SUM(if(appsearch.type=2,1,0))incoming_calls_7d,

SUM(if(appsearch.type=5,1,0))sms_7d

FROMdata.app_events

WHEREdt>=20160101ANDdt<=20160107

ANDupper(appsearch.searchrequest.searchcountrycode)IN('SE','US')

ANDappsearch.searchrequest.normalizedphonenumberISNOTNULL

GROUPBY

appsearch.searchrequest.searchcountrycode,

appsearch.searchrequest.normalizedphonenumber)x

WHEREincoming_calls_7d>0ORsms_7d>0

SORTBYcountrycode)a

LEFTJOIN

(SELECT

countrycode,

incoming_phonenr,

incoming_calls_3d,

sms_3d

FROM(

SELECT

appsearch.searchrequest.searchcountrycodeascountrycode,

appsearch.searchrequest.normalizedphonenumberASincoming_phonenr,

SUM(if(appsearch.type=2,1,0))incoming_calls_3d,

SUM(if(appsearch.type=5,1,0))sms_3d

FROMdata.app_events

WHEREdt>=20160101ANDdt<=20160103

Page 53: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

45

ANDupper(appsearch.searchrequest.searchcountrycode)IN('SE','US')

ANDappsearch.searchrequest.normalizedphonenumberISNOTNULL

GROUPBY

appsearch.searchrequest.searchcountrycode,

appsearch.searchrequest.normalizedphonenumber)x

WHEREincoming_calls_3d>0ORsms_3d>0

SORTBYcountrycode)b

ONa.countrycode=b.countrycode

ANDa.incoming_phonenr=b.incoming_phonenr

LEFTJOIN

(SELECT

countrycode,

incoming_phonenr,

incoming_calls_24h,

sms_24h

FROM(

SELECT

appsearch.searchrequest.searchcountrycodeascountrycode,

appsearch.searchrequest.normalizedphonenumberASincoming_phonenr,

SUM(if(appsearch.type=2,1,0))incoming_calls_24h,

SUM(if(appsearch.type=5,1,0))sms_24h

FROMdata.app_events

WHEREdt=20160101

ANDupper(appsearch.searchrequest.searchcountrycode)IN('SE','US')

ANDappsearch.searchrequest.normalizedphonenumberISNOTNULL

GROUPBY

appsearch.searchrequest.searchcountrycode,

appsearch.searchrequest.normalizedphonenumber)x

WHEREincoming_calls_24h>0ORsms_24h>0

SORTBYcountrycode)c

Page 54: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

46

ONb.countrycode=c.countrycode

ANDb.incoming_phonenr=c.incoming_phonenr

SORTBYcountrycode

;

13.Createtableoutof3joinsFROMlargetablewithmultipleaggregatesANDGROUPBY.

CREATETABLEsahir.nr_search_events_30days_v2AS

SELECT

a.dt,

a.country_code,

COALESCE(b.hits,0L)AStotal_result_hits,

COALESCE(a.hits,0L)AS1_result_hits,

COALESCE(c.hits,0L)AS0_result_hits

FROM

(SELECT

dt,

LOWER(user.country_code)country_code,

COUNT(*)hits

FROMdata.app_events

WHEREdtBETWEEN20160201AND20160301

ANDSIZE(search.sources)>0

ANDLOWER(search.sources[0])LIKE'%ugc%'

ANDsearch.result_size=1

GROUPBY

dt,

LOWER(user.country_code))a

LEFTJOIN

(SELECT

dt,

Page 55: BACHELOR THESIS - ltu.diva-portal.orgltu.diva-portal.org/smash/get/diva2:1020690/FULLTEXT02.pdfdistributed file system. In order to be able to analyse these large amounts of data the

SahirAhmed 2016-06-15D0032N–BachelorThesisEvaluatingPrestoasanSQLonHadoopsolution–acaseattruecaller

47

LOWER(user.country_code)country_code,

COUNT(*)hits

FROMdata.app_events

WHEREdtBETWEEN20160201AND20160301

ANDSIZE(search.sources)>0

ANDsearch.result_size>0

GROUPBY

dt,

LOWER(user.country_code))b

ONa.dt=b.dt

ANDLOWER(a.country_code)=LOWER(b.country_code)

LEFTJOIN

(SELECT

dt,

LOWER(user.country_code)country_code,

COUNT(*)hits

FROMdata.app_events

WHEREdtBETWEEN20160201AND20160301

ANDSIZE(search.sources)>0

ANDsearch.result_size=0

GROUPBY

dt,

LOWER(user.country_code))c

ONb.dt=c.dt

ANDLOWER(b.country_code)=LOWER(c.country_code)