apache hive essentials - droppdf2.droppdf.com/files/i9ekz/apache-hive-essentials.pdf · neha...
TRANSCRIPT
www.it-ebooks.info
www.it-ebooks.info
TableofContents
ApacheHiveEssentials
Credits
AbouttheAuthor
AbouttheReviewers
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmore
Whysubscribe?
FreeaccessforPacktaccountholders
Preface
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Downloadingtheexamplecode
Errata
Piracy
Questions
1.OverviewofBigDataandHive
Ashorthistory
Introducingbigdata
RelationalandNoSQLdatabaseversusHadoop
Batch,real-time,andstreamprocessing
OverviewoftheHadoopecosystem
Hiveoverview
Summary
2.SettingUptheHiveEnvironment
InstallingHivefromApache
www.it-ebooks.info
InstallingHivefromvendorpackages
StartingHiveinthecloud
UsingtheHivecommandlineandBeeline
TheHive-integrateddevelopmentenvironment
Summary
3.DataDefinitionandDescription
UnderstandingHivedatatypes
Datatypeconversions
HiveDataDefinitionLanguage
Hivedatabase
Hiveinternalandexternaltables
Hivepartitions
Hivebuckets
Hiveviews
Summary
4.DataSelectionandScope
TheSELECTstatement
TheINNERJOINstatement
TheOUTERJOINandCROSSJOINstatements
SpecialJOIN–MAPJOIN
Setoperation–UNIONALL
Summary
5.DataManipulation
Dataexchange–LOAD
Dataexchange–INSERT
Dataexchange–EXPORTandIMPORT
ORDERandSORT
Operatorsandfunctions
Transactions
Summary
6.DataAggregationandSampling
www.it-ebooks.info
Basicaggregation–GROUPBY
Advancedaggregation–GROUPINGSETS
Advancedaggregation–ROLLUPandCUBE
Aggregationcondition–HAVING
Analyticfunctions
Sampling
Summary
7.PerformanceConsiderations
Performanceutilities
TheEXPLAINstatement
TheANALYZEstatement
Designoptimization
Partitiontables
Buckettables
Index
Datafileoptimization
Fileformat
Compression
Storageoptimization
Jobandqueryoptimization
Localmode
JVMreuse
Parallelexecution
Joinoptimization
Commonjoin
Mapjoin
Bucketmapjoin
Sortmergebucket(SMB)join
Sortmergebucketmap(SMBM)join
Skewjoin
Summary
www.it-ebooks.info
8.ExtensibilityConsiderations
User-definedfunctions
TheUDFcodetemplate
TheUDAFcodetemplate
TheUDTFcodetemplate
Developmentanddeployment
Streaming
SerDe
Summary
9.SecurityConsiderations
Authentication
Metastoreserverauthentication
HiveServer2authentication
Authorization
Legacymode
Storage-basedmode
SQLstandard-basedmode
Encryption
Summary
10.WorkingwithOtherTools
JDBC/ODBCconnector
HBase
Hue
HCatalog
ZooKeeper
Oozie
Hiveroadmap
Summary
Index
www.it-ebooks.info
www.it-ebooks.info
www.it-ebooks.info
ApacheHiveEssentialsCopyright©2015PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
Firstpublished:February2015
Productionreference:1210215
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
BirminghamB32PB,UK.
ISBN978-1-78355-857-5
www.packtpub.com
www.it-ebooks.info
www.it-ebooks.info
CreditsAuthor
DayongDu
Reviewers
PuneethaBM
HamzehKhazaei
NitinPradeepKumar
BalaswamyVaddeman
CommissioningEditor
AshwinNair
AcquisitionEditor
ShaonBasu
ContentDevelopmentEditor
MerwynD’souza
TechnicalEditor
TaabishKhan
CopyEditors
SameenSiddiqui
LaxmiSubramanian
ProjectCoordinator
NehaBhatnagar
Proofreaders
PaulHindle
JonathanTodd
Indexer
MonicaAjmeraMehta
ProductionCoordinator
AparnaBhagat
CoverWork
AparnaBhagat
www.it-ebooks.info
www.it-ebooks.info
AbouttheAuthorDayongDuisabigdatapractitioner,leader,anddeveloperwithexpertiseintechnologyconsulting,designing,andimplementingenterprisebigdatasolutions.Withmorethan10yearsofexperienceinenterprisedatawarehouse,businessintelligence,andbigdataandanalytics,hehasprovidedhisdataintelligenceexpertiseinvariousindustries,suchasmedia,travel,telecommunications,andsoon.HeiscurrentlyworkingwithQuickPlayMediainToronto,Canada,tobuildenterprisebigdataintelligencereportingforonlinemediaservicesandcontentproviders.Hehasamaster’sdegreeincomputersciencefromDalhousieUniversity,andheholdstheClouderaCertifiedDeveloperforApacheHadoopcertification.
Iwouldliketosincerelythankmywife,Joice,anddaughter,Elaine,fortheirsacrificesandencouragementduringthisjourney.Also,Iwouldliketothankmyparentsfortheirsupportduringthetimeofwritingthisbook.
IwouldalsoliketothankeveryoneatPacktPublishingandthetechnicalreviewersfortheirvaluablehelp,guidance,andfeedbackonmybook.
www.it-ebooks.info
www.it-ebooks.info
AbouttheReviewersPuneethaBMisasoftwareengineer,dataenthusiast,andtechnicalblogger.Herresearchinterestsincludebigdata,cloudcomputing,machinelearning,andNoSQLdatabases.Sheisalsoaprofessionalsoftwareengineerwithmorethan2yearsofworkingexperience.Sheholdsamaster’sdegreeincomputerapplicationsfromP.E.S.InstituteofTechnology.Otherthanprogramming,sheenjoyspaintingandlisteningtomusic.Youcanlearnmorefromherblog(http://blog.puneethabm.in/)andLinkedInprofile(https://www.linkedin.com/in/puneethabm).
IoweagreatdealtoProf.Dr.RamRustagiforbeingarolemodelinmylifeandforhiszealousinspiration.Iwouldliketothankmybrother,NischithB.M.,forsupportingmeineverythingIdo.IwouldalsoliketothankPacktPublishinganditsstaffforprovidingtheopportunitytocontributetothisbook.
HamzehKhazaeiisapostdoctoralresearchscientistatIBMCanadaResearchandDevelopmentCentre.HereceivedhisPhDdegreeincomputersciencefromUniversityofManitoba,Winnipeg,Manitoba,Canada(2009–2012).Earlier,hereceivedbothhisBScandMScdegreesincomputersciencefromAmirkabirUniversityofTechnology,Tehran,Iran(2000–2008).HeisalsoasessionalinstructorintheComputerSciencedepartmentatRyersonUniversity(http://scs.ryerson.ca/~hkhazaei).Heteachessoftwareengineeringtofourthyearundergraduatestudents.Hisresearchareaincludesbigdataanalytics,cloudcomputinginfrastructure,analyticsasaservice,andmodelingofcomputingsystems.
Iwouldliketothankmydearwifeforherperpetualsupportinallmyendeavors.
NitinPradeepKumarisapassionatedeveloperwithextensiveexperienceandoodlesofinterestinemergingtechnologiessuchasthecloudandmobile.HeiscurrentlyacloudqualityengineeratAppcelerator,aleadingSiliconValley-basedstart-upthatprovidesanMBaaSplatformpurpose-builtformobileandclouddevelopment.Beforethisstint,hestudiedattheNationalUniversityofSingaporetowardamaster’sdegreeinknowledgeengineering,whichinvolvesbuildingintelligentsystemsusingcutting-edgeartificialintelligenceanddata-miningtechniques.Heenjoysthestart-upenvironmentandhasworkedwithtechnologiessuchasHadoop,Hive,anddatawarehousing.HelivesinSingaporeandspendshissparecyclesplayingretroPCgamesonhismobileandlearningMuayThai.
Iwouldliketothankmyfamily,friends,andmywonderfulbrother,Nivin,forsupportingmeinallmyendeavors.
BalaswamyVaddemanisaHadoophackathonwinnerforHyderabadin2013.HeisoneofthetopcontributorsontheHivetagathttp://www.stackoverflow.com.Heisabigdataprofessionalwith3yearsofexperience.Heiswellknownfortrainingpeopleonbigdata/Hadoop.Sofar,hehasdeliveredsixbigdataprojects.HeisaJava/J2EEexpertwith8yearsofITexperienceand5yearsofRDBMSexperience.HeisanautomationexpertonUnix-basedsystemsusingShellscripting.Hehasexperienceinsettingupteamsandbringingthemuptospeedonbigdataprojects.HeisanactiveparticipantinHadoop/big
www.it-ebooks.info
dataforums.
Iwouldliketothankmywife,Radha,myson,Pandu,andmydaughter,Bubly,fortheircooperationincompletingthisbook.
www.it-ebooks.info
www.it-ebooks.info
Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.
DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.
Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.
https://www2.packtpub.com/books/subscription/packtlib
DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.
www.it-ebooks.info
Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser
www.it-ebooks.info
FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.
Idedicatethisbooktomydaughter
www.it-ebooks.info
www.it-ebooks.info
PrefaceWithanincreasinginterestinbigdataanalysis,HiveoverHadoopbecomesacutting-edgedatasolutionforstoring,computing,andanalyzingbigdata.TheSQL-likesyntaxmakesHiveeasiertolearnandpopularlyacceptedasastandardforinteractiveSQLqueriesoverbigdata.ThevarietyoffeaturesavailablewithinHiveprovidesuswiththecapabilityofdoingcomplexbigdataanalysiswithoutadvancedcodingskills.ThematurityofHiveletsitgraduallymergeandshareitsvaluablearchitectureandfunctionalitiesacrossdifferentcomputingframeworksbeyondHadoop.
ApacheHiveEssentialspreparesyourjourneytobigdatabycoveringtheintroductionofbackgroundsandconceptsinthebigdatadomainalongwiththeprocessofsettingupandgettingfamiliarwithyourHiveworkingenvironmentinthefirsttwochapters.Inthenextfourchapters,thebookguidesyouthroughdiscoveringandtransformingthevaluebehindbigdatabyexamplesandskillsofHivequerylanguages.Inthelastfourchapters,thebookhighlightswell-selectedandadvancedtopics,suchasperformance,security,andextensionsasexcitingadventuresforthisworthwhilebigdatajourney.
www.it-ebooks.info
WhatthisbookcoversChapter1,OverviewofBigDataandHive,introducestheevolutionofbigdata,theHadoopecosystem,andHive.YouwillalsolearntheHivearchitectureandtheadvantagesofusingHiveinbigdataanalysis.
Chapter2,SettingUptheHiveEnvironment,describestheHiveenvironmentsetupandconfiguration.ItalsocoversusingHivethroughthecommandlineanddevelopmenttools.
Chapter3,DataDefinitionandDescription,introducesthebasicdatatypesanddatadefinitionlanguagefortables,partitions,buckets,andviewsinHive.
Chapter4,DataSelectionandScope,showsyouwaystodiscoverthedatabyquerying,linking,andscopingthedatainHive.
Chapter5,DataManipulation,describestheprocessofexchanging,moving,sorting,andtransformingthedatainHive.
Chapter6,DataAggregationandSampling,explainshowtodoaggregationandsampleusingaggregationfunctions,analyticfunctions,windowing,andsampleclauses.
Chapter7,PerformanceConsiderations,introducesthebestpracticesofperformanceconsiderationsintheaspectsofdesign,fileformat,compression,storage,query,andjob.
Chapter8,ExtensibilityConsiderations,describeshowtoextendHivebycreatinguser-definedfunctions,streaming,serializers,anddeserializers.
Chapter9,SecurityConsiderations,introducestheareaofHivesecurityintermsofauthentication,authorization,andencryption.
Chapter10,WorkingwithOtherTools,discusseshowHiveworkswithotherbigdatatools.ItalsoreviewsthekeymilestonesofHivereleases.
www.it-ebooks.info
www.it-ebooks.info
WhatyouneedforthisbookYouwillneedtoinstallbothHadoopandHivetoruntheexamplesinthisbook.ThescriptsinthisbookwerewrittenandtestedwithClouderaDistributedHadoop(CDH)v5.3(containsHivev0.13.xandHadoopv2.5.0),HortonworksDataPlatform(HDP)v2.2(containsHivev0.14.0andHadoopv2.6.0),andApacheHive1.0.0(withHadoop1.2.1)inpseudo-distributedmode.However,themajorityofthescriptswillalsorunonthepreviousversionsofHadoopandHive.ThefollowingaretheothersoftwareapplicationsyoumayneedforabetterunderstandingoftheHive-relatedtoolsmentionedinthebook.ThesetoolsarealsoavailableintheCDHorHDPpackages.
Hue2.2.0andaboveHBase0.98.4Oozie4.0.0andaboveZookeeper3.4.5Tez0.6.0
www.it-ebooks.info
www.it-ebooks.info
WhothisbookisforIfyouareadataanalyst,developer,anduserwhowantstouseHivetoexploreandanalyzedatainHadoop,thisisthebookforyou.Whetheryouarenewtobigdataoranexpert,youwillbeabletomasterboththebasicandtheadvancedfeaturesofHive.SinceHiveisanSQL-likelanguage,somepreviousexperiencewiththeSQLlanguageanddatabaseisusefultohaveabetterunderstandingofthisbook.
www.it-ebooks.info
www.it-ebooks.info
ConventionsInthisbook,youwillfindanumberoftextstylesthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestylesandanexplanationoftheirmeaning.
Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:“Aggregatefunctioncanbeusedwithotheraggregatefunctionsinthesameselectstatement.”
Ablockofcodeissetasfollows:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://myhost:3306/hive?createDatabase
IfNotExist=true</value>
<description>JDBCconnectstringforaJDBCmetastore</description>
</property>
Whenwewishtodrawyourattentiontoaparticularpartofacodeblock,therelevantlinesoritemsaresetinbold:
customAuthenticator.java
packagecom.packtpub.hive.essentials.hiveudf;
importjava.util.Hashtable;
importjavax.security.sasl.AuthenticationException;
importorg.apache.hive.service.auth.PasswdAuthenticationProvider;
Anycommand-lineinputoroutputiswrittenasfollows:
bash-4.1$hdfsdfs–mkdir/tmp
Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,forexample,inmenusordialogboxes,appearinthetextlikethis:“ClickontheOKbuttonandrestartOracleSQLDeveloper.”
NoteWarningsorimportantnotesappearinaboxlikethis.
TipTipsandtricksappearlikethis.
www.it-ebooks.info
www.it-ebooks.info
ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.
Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthebook’stitleinthesubjectofyourmessage.
Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.
www.it-ebooks.info
www.it-ebooks.info
CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.
www.it-ebooks.info
DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.
www.it-ebooks.info
ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.
Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.
www.it-ebooks.info
PiracyPiracyofcopyrightedmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.IfyoucomeacrossanyillegalcopiesofourworksinanyformontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.
Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.
Weappreciateyourhelpinprotectingourauthorsandourabilitytobringyouvaluablecontent.
www.it-ebooks.info
QuestionsIfyouhaveaproblemwithanyaspectofthisbook,youcancontactusat<[email protected]>,andwewilldoourbesttoaddresstheproblem.
www.it-ebooks.info
www.it-ebooks.info
Chapter1.OverviewofBigDataandHiveThischapterisanoverviewofbigdataandHive,especiallyintheHadoopecosystem.Itbrieflyintroducestheevolutionofbigdatasothatreadersknowwheretheyareinthejourneyofbigdataandfindtheirpreferredareasinfuturelearning.ThischapteralsocovershowHivehasbecomeoneoftheleadingtoolsinbigdatawarehousingandwhyHiveisstillcompetitive.
Inthischapter,wewillcoverthefollowingtopics:
AshorthistoryfromdatabaseanddatawarehousetobigdataIntroducingbigdataRelationalandNoSQLdatabasesversusHadoopBatch,real-time,andstreamprocessingHadoopecosystemoverviewHiveoverview
www.it-ebooks.info
AshorthistoryInthe1960s,whencomputersbecameamorecost-effectiveoptionforbusinesses,peoplestartedtousedatabasestomanagedata.Lateron,inthe1970s,relationaldatabasesbecamemorepopulartobusinessneedssincetheyconnectedphysicaldatatothelogicalbusinesseasilyandclosely.Inthenextdecade,aroundthe1980s,StructuredQueryLanguage(SQL)becamethestandardquerylanguagefordatabases.TheeffectivenessandsimplicityofSQLmotivatedlotsofpeopletousedatabasesandbroughtdatabasesclosertoawiderangeofusersanddevelopers.Soon,itwasobservedthatpeopleuseddatabasesfordataapplicationandmanagementandthiscontinuedforalongperiodoftime.
Onceplentyofdatawascollected,peoplestartedtothinkabouthowtodealwiththeolddata.Then,thetermdatawarehousingcameupinthe1990s.Fromthattimeonwards,peoplestartedtodiscusshowtoevaluatethecurrentperformancebyreviewingthehistoricaldata.Variousdatamodelsandtoolswerecreatedatthattimeforhelpingenterprisestoeffectivelymanage,transform,andanalyzethehistoricaldata.Traditionalrelationaldatabasesalsoevolvedtoprovidemoreadvancedaggregationandanalyzedfunctionsaswellasoptimizationsfordatawarehousing.TheleadingquerylanguagewasstillSQL,butitwasmoreintuitiveandpowerfulascomparedtothepreviousversions.Thedatawasstillwellstructuredandthemodelwasnormalized.Asweenteredthe2000s,theInternetgraduallybecamethetopmostindustryforthecreationofthemajorityofdataintermsofvarietyandvolume.Newertechnologies,suchassocialmediaanalytics,webmining,anddatavisualizations,helpedlotsofbusinessesandcompaniesdealwithmassiveamountsofdataforabetterunderstandingoftheircustomers,products,competition,aswellasmarkets.Thedatavolumegrewandthedataformatchangedfasterthaneverbefore,whichforcedpeopletosearchfornewsolutions,especiallyfromtheacademicandopensourceareas.Asaresult,bigdatabecameahottopicandachallengingfieldformanyresearchersandcompanies.
However,ineverychallengethereliesgreatopportunity.Hadoopwasoneoftheopensourceprojectsearningwideattentionduetoitsopensourcelicenseandactivecommunities.Thiswasoneofthefewtimesthatanopensourceprojectledtothechangesintechnologytrendsbeforeanycommercialsoftwareproducts.Soonafter,theNoSQLdatabaseandreal-timeandstreamcomputing,asfollowers,quicklybecameimportantcomponentsforbigdataecosystems.Armedwiththesebigdatatechnologies,companieswereabletoreviewthepast,evaluatethecurrent,andalsopredictthefuture.Aroundthe2010s,timetomarketbecamethekeyfactorformakingbusinesscompetitiveandsuccessful.Whenitcomestobigdataanalysis,peoplecouldnotwaittoseethereportsorresults.Ashortdelaycouldmakeagreatdifferencewhenmakingimportantbusinessdecisions.Decisionmakerswantedtoseethereportsorresultsimmediatelywithinafewhours,minutes,orevenpossiblysecondsinafewcases.Real-timeanalyticaltools,suchasImpala(http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html),Presto(http://prestodb.io/),Storm(https://storm.apache.org/),andsoon,makethispossibleindifferentways.
www.it-ebooks.info
www.it-ebooks.info
www.it-ebooks.info
IntroducingbigdataBigdataisnotsimplyabigvolumeofdata.Here,theword“Big”referstothebigscopeofdata.Awell-knownsayinginthisdomainistodescribebigdatawiththehelpofthreewordsstartingwiththeletterV.Theyarevolume,velocity,andvariety.Buttheanalyticalanddatascienceworldhasseendatavaryinginotherdimensionsinadditiontothefundament3Vsofbigdatasuchasveracity,variability,volatility,visualization,andvalue.ThedifferentVsmentionedsofarareexplainedasfollows:
Volume:Thisreferstotheamountofdatageneratedinseconds.90percentoftheworld’sdatatodayhasbeencreatedinthelasttwoyears.Sincethattime,thedataintheworlddoubleseverytwoyears.Suchbigvolumesofdataismainlygeneratedbymachines,networks,socialmedia,andsensors,includingstructured,semi-structured,andunstructureddata.Velocity:Thisreferstothespeedinwhichthedataisgenerated,stored,analyzed,andmovedaround.WiththeavailabilityofInternet-connecteddevices,wirelessorwired,machinesandsensorscanpassontheirdataimmediatelyassoonasitiscreated.Thisleadstoreal-timestreamingandhelpsbusinessestomakevaluableandfastdecisions.Variety:Thisreferstothedifferentdataformats.Datausedtobestoredastext,dat,andcsvfromsourcessuchasfilesystems,spreadsheets,anddatabases.Thistypeofdatathatresidesinafixedfieldwithinarecordorfileiscalledstructureddata.Nowadays,dataisnotalwaysinthetraditionalformat.Thenewersemi-structuredorunstructuredformsofdatacanbegeneratedusingvariousmethodssuchase-mails,photos,audio,video,PDFs,SMSes,orevensomethingwehavenoideaabout.Thesevarietiesofdataformatscreateproblemsforstoringandanalyzingdata.Thisisoneofthemajorchallengesweneedtoovercomeinthebigdatadomain.Veracity:Thisreferstothequalityofdata,suchastrustworthiness,biases,noise,andabnormalityindata.Corruptdataisquitenormal.Itcouldoriginateduetoanumberofreasons,suchastypos,missingoruncommonabbreviation,datareprocessing,systemfailures,andsoon.However,ignoringthismaliciousdatacouldleadtoinaccuratedataanalysisandeventuallyawrongdecision.Therefore,makingsurethedataiscorrectintermsofdataauditionandcorrectionisveryimportantforbigdataanalysis.Variability:Thisreferstothechangingofdata.Itmeansthatthesamedatacouldhavedifferentmeaningsindifferentcontexts.Thisisparticularlyimportantwhencarryingoutsentimentanalysis.Theanalysisalgorithmsareabletounderstandthecontextanddiscovertheexactmeaningandvaluesofdatainthatcontext.Volatility:Thisreferstohowlongthedataisvalidandstored.Thisisparticularlyimportantforreal-timeanalysis.Itrequiresatargetscopeofdatatobedeterminedsothatanalystscanfocusonparticularquestionsandgaingoodperformanceoutoftheanalysis.Visualization:Thisreferstothewayofmakingdatawellunderstood.Visualizationdoesnotmeanordinarygraphsorpiecharts.Itmakesvastamountsofdata
www.it-ebooks.info
comprehensibleinamultidimensionalviewthatiseasytounderstand.Visualizationisaninnovativewaytoshowchangesindata.Itrequireslotsofinteraction,conversations,andjointeffortsbetweenbigdataanalystsandbusinessdomainexpertstomakethevisualizationmeaningful.Value:Thisreferstotheknowledgegainedfromdataanalysisonbigdata.Thevalueofbigdataishoworganizationsturnthemselvesintobigdata-drivencompaniesandusetheinsightfrombigdataanalysisfortheirdecisionmaking.
Insummary,bigdataisnotjustaboutlotsofdata,itisapracticetodiscovernewinsightfromexistingdataandguidetheanalysisforfuturedata.Abig-data-drivenbusinesswillbemoreagileandcompetitivetoovercomechallengesandwincompetitions.
www.it-ebooks.info
www.it-ebooks.info
RelationalandNoSQLdatabaseversusHadoopLet’scomparedifferentdatasolutionswiththewaysoftraveling.Youwillbesurprisedtofindthattheyhavemanysimilarities.Whenpeopletravel,theyeithertakecarsorairplanesdependingonthetraveldistanceandcost.Forexample,whenyoutraveltoVancouverfromToronto,anairplaneisalwaysthefirstchoiceintermsofthetraveltimeversuscost.WhenyoutraveltoNiagaraFallsfromToronto,acarisalwaysagoodchoice.WhenyoutraveltoMontrealfromToronto,somepeoplemayprefertakingacartoanairplane.Thedistanceandcosthereislikethebigdatavolumeandinvestment.Thetraditionalrelationaldatabaseislikethecarinthisexample.TheHadoopbigdatatoolisliketheairplaneinthisexample.Whenyoudealwithasmallamountofdata(shortdistance),arelationaldatabase(likethecar)isalwaysthebestchoicesinceitismorefastandagiletodealwithasmallormoderatesizeofdata.Whenyoudealwithabigamountofdata(longdistance),Hadoop(liketheairplane)isthebestchoicesinceitismorelinear,fast,andstabletodealwiththebigsizeofdata.Onthecontrary,youcandrivefromTorontotoVancouver,butittakestoomuchtime.YoucanalsotakeanairplanefromTorontotoNiagara,butitcouldtakemoretimeandcostwaymorethanifyoutravelbyacar.Inaddition,youmayhaveachoicetoeithertakeashiporatrain.ThisislikeaNoSQLdatabase,whichofferscharactersfrombotharelationaldatabaseandHadoopintermsofgoodperformanceandvariousdataformatsupportforbigdata.
www.it-ebooks.info
www.it-ebooks.info
Batch,real-time,andstreamprocessingBatchprocessingisusedtoprocessdatainbatchesanditreadsdatainput,processesit,andwritesittotheoutput.ApacheHadoopisthemostwell-knownandpopularopensourceimplementationofbatchprocessingandadistributedsystemusingtheMapReduceparadigm.ThedataisstoredinasharedanddistributedfilesystemcalledHadoopDistributedFileSystem(HDFS),dividedintosplits,whicharethelogicaldatadivisionsforMapReduceprocessing.ToprocessthesesplitsusingtheMapReduceparadigm,themaptaskreadsthesplitsandpassesallofitskey/valuepairstoamapfunctionandwritestheresultstointermediatefiles.Afterthemapphaseiscompleted,thereducerreadsintermediatefilesandpassesittothereducefunction.Finally,thereducetaskwritesresultstothefinaloutputfiles.TheadvantagesoftheMapReducemodelincludemakingdistributedprogrammingeasier,near-linearspeedup,goodscalability,aswellasfaulttolerance.Thedisadvantageofthisbatchprocessingmodelisbeingunabletoexecuterecursiveoriterativejobs.Inaddition,theobviousbatchbehavioristhatallinputsmustbereadybymapbeforethereducejobstarts,whichmakesMapReduceunsuitableforonlineandstreamprocessingusecases.
Real-timeprocessingistoprocessdataandgettheresultalmostimmediately.Thisconceptintheareaofreal-timeadhocqueriesoverbigdatawasfirstimplementedinDremelbyGoogle.Itusesanovelcolumnarstorageformatfornestedstructureswithfastindexandscalableaggregationalgorithmsforcomputingqueryresultsinparallelinsteadofbatchsequences.Thesetwotechniquesarethemajorcharactersforreal-timeprocessingandareusedbysimilarimplementations,suchasClouderaImpala,FacebookPresto,ApacheDrill,andHiveonTezpoweredbyStingerwhoseeffortistomakea100xperformanceimprovementoverApacheHive.Ontheotherhand,in-memorycomputingnodoubtoffersothersolutionsforreal-timeprocessing.In-memorycomputingoffersveryhighbandwidth,whichismorethan10gigabytes/second,comparedtoharddisks’200megabytes/second.Also,thelatencyiscomparativelylower,nanosecondsversusmilliseconds,comparedtoharddisks.WiththepriceofRAMgoinglowerandlowereachday,in-memorycomputingismoreaffordableasreal-timesolutions,suchasApacheSpark,whichisapopularopensourceimplementationofin-memorycomputing.SparkcanbeeasilyintegratedwithHadoopandtheresilientdistributeddatasetcanbegeneratedfromdatasourcessuchasHDFSandHBaseforefficientcaching.
Streamprocessingistocontinuouslyprocessandactonthelivestreamdatatogetaresult.Instreamprocessing,therearetwopopularframeworks:Storm(https://storm.apache.org/)fromTwitterandS4(http://incubator.apache.org/s4/)fromYahoo!.BoththeframeworksrunontheJavaVirtualMachine(JVM)andbothprocesskeyedstreams.Intermsoftheprogrammingmodel,S4isaprogramdefinedasagraphofProcessingElements(PE),smallsubprograms,andS4instantiatesaPEperkey.Inshort,Stormgivesyouthebasictoolstobuildaframework,whileS4givesyouawell-definedframework.
www.it-ebooks.info
www.it-ebooks.info
OverviewoftheHadoopecosystemHadoopwasfirstreleasedbyApachein2011asversion1.0.0.ItonlycontainedHDFSandMapReduce.Hadoopwasdesignedasbothacomputing(MapReduce)andstorage(HDFS)platformfromtheverybeginning.Withtheincreasingneedforbigdataanalysis,HadoopattractslotsofothersoftwaretoresolvebigdataquestionstogetherandmergestoaHadoop-centricbigdataecosystem.ThefollowingdiagramgivesabriefintroductiontotheHadoopecosystemandthecoresoftwareorcomponentsintheecosystems:
TheHadoopecosystem
InthecurrentHadoopecosystem,HDFSisstillthemajorstorageoption.Ontopofit,snappy,RCFile,Parquet,andORCFilecouldbeusedforstorageoptimization.CoreHadoopMapReducereleasedaversion2.0calledYarnforbetterperformanceandscalability.SparkandTezassolutionsforreal-timeprocessingareabletorunontheYarntoworkwithHadoopclosely.HBaseisaleadingNoSQLdatabase,especiallywhenthereisaNoSQLdatabaserequestonthedeployedHadoopclusters.SqoopisstilloneoftheleadingandmaturedtoolsforexchangingdatabetweenHadoopandrelationaldatabases.Flumeisamatureddistributedandreliablelog-collectingtooltomoveorcollectdatatoHDFS.ImpalaandPrestoquerydirectlyagainstthedataonHDFSforbetterperformance.However,HortonworksfocusesonStringerinitiativestomakeHive100timesfaster.Inaddition,HiveoverSparkandHiveoverTezofferachoiceforuserstorunHiveonothercomputingframeworksratherthanMapReduce.Asaresult,Hiveisplayingmoreimportantrolesintheecosystemthanever.
www.it-ebooks.info
www.it-ebooks.info
HiveoverviewHiveisastandardforSQLqueriesoverpetabytesofdatainHadoop.ItprovidesSQL-likeaccessfordatainHDFSmakingHadooptobeusedlikeawarehousestructure.TheHiveQueryLanguage(HQL)hassimilarsemanticsandfunctionsasstandardSQLintherelationaldatabasesothatexperienceddatabaseanalystscaneasilygettheirhandsonit.Hive’squerylanguagecanrunondifferentcomputingframeworks,suchasMapReduce,Tez,andSparkforbetterperformance.
Hive’sdatamodelprovidesahigh-level,table-likestructureontopofHDFS.Itsupportsthreedatastructures:tables,partitions,andbuckets,wheretablescorrespondtoHDFSdirectoriesandcanbedividedintopartitions,whichinturncanbedividedintobuckets.HivesupportsamajorityofprimitivedataformatssuchasTIMESTAMP,STRING,FLOAT,BOOLEAN,DECIMAL,DOUBLE,INT,SMALLINT,BIGINT,andcomplexdatatypes,suchasUNION,STRUCT,MAP,andARRAY.
ThefollowingdiagramisthearchitectureseeninsidetheviewofHiveintheHadoopecosystem.TheHivemetadatastore(orcalledmetastore)canuseeitherembedded,local,orremotedatabases.HiveserversarebuiltonApacheThriftServertechnology.SinceHivehasreleased0.11,HiveServer2isavailabletohandlemultipleconcurrentclients,whichsupportKerberos,LDAP,andcustompluggableauthentication,providingbetteroptionsforJDBCandODBCclients,especiallyformetadataaccess.
Hivearchitecture
HerearesomehighlightsofHivethatwecankeepinmindmovingforward:
HiveprovidesasimplerquerymodelwithlesscodingthanMapReduceHQLandSQLhavesimilarsyntaxHiveprovideslotsoffunctionsthatleadtoeasieranalyticsusageTheresponsetimeistypicallymuchfasterthanothertypesofqueriesonthesame
www.it-ebooks.info
typeofhugedatasetsHivesupportsrunningondifferentcomputingframeworksHivesupportsadhocqueryingdataonHDFSHivesupportsuser-definedfunctions,scripts,andacustomizedI/OformattoextenditsfunctionalityHiveisscalableandextensibletovarioustypesofdataandbiggerdatasetsMaturedJDBCandODBCdriversallowmanyapplicationstopullHivedataforseamlessreportingHiveallowsuserstoreaddatainarbitraryformats,usingSerDesandInput/OutputformatsHivehasawell-definedarchitectureformetadatamanagement,authentication,andqueryoptimizationsThereisabigcommunityofpractitionersanddevelopersworkingonandusingHive
www.it-ebooks.info
www.it-ebooks.info
SummaryAftergoingthroughthischapter,wearenowabletounderstandwhyandwhentousebigdatainsteadofatraditionalrelationaldatabase.Wealsounderstandthedifferencebetweenbatchprocessing,real-timeprocessing,andstreamprocessing.WegotfamiliarwiththeHadoopecosystem,especiallyHive.Wehavealsogonebackintimeandbrushedthroughthehistoryofdatabaseandwarehousetobigdataalongwithsomebigdataterms,theHadoopecosystem,Hivearchitecture,andtheadvantageofusingHive.Inthenextchapter,wewillpracticesettingupHiveandallthetoolsneededtogetstartedusingHiveinthecommandline.
www.it-ebooks.info
www.it-ebooks.info
Chapter2.SettingUptheHiveEnvironmentThischapterwillintroducehowtoinstallandsetuptheHiveenvironmentintheclusterandcloud.ItalsocoverstheusageofbasicHivecommandsandtheHive-integrateddevelopmentenvironment.
Inthischapter,wewillcoverthefollowingtopics:
InstallingHivefromApacheInstallingHivefromvendorpackagesStartingHiveinthecloudUsingtheHivecommandlineandBeelineTheHive-integrateddevelopmentenvironment
www.it-ebooks.info
InstallingHivefromApacheTointroducetheHiveinstallation,weuseHiveversion1.0.0asanexample.Thepre-installationrequirementsforthisinstallationareasfollows:
JDK1.7.0_51Hadoop0.20.x,0.23.x.y,1.x.y,or2.x.yUbuntu14.04/CentOS6.2
NoteSincewefocusonHiveinthisbook,theinstallationstepsforJavaandHadooparenotprovidedhere.Forstepsoninstallingthem,pleaserefertohttps://www.java.com/en/download/help/download_options.xmlandhttp://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html.
ThefollowingstepsdescribehowtoinstallHivefromApachethroughtheLinuxcommandline:
1. DownloadHivefromApacheHiveandunpackit:
bash-4.1$wgethttp://apache.mirror.rafal.ca/hive/hive-1.0.0/apache-
hive-1.0.0-bin.tar.gz
bash-4.1$tar-zxvfapache-hive-1.0.0-bin.tar.gz
2. AddHivetothesystempathbyopening/etc/profileor~/.bashrcandaddthefollowingtworows:
exportHIVE_HOME=/home/hivebooks/apache-hive-1.0.0-bin
exportPATH=$PATH:$HIVE_HOME/bin:$HIVE_HOME/conf
3. Enablethesettingsimmediately:
bash-4.1$source/etc/profile
4. Createtheconfigurationfiles:
bash-4.1$cdapache-hive-1.0.0-bin/conf
bash-4.1$cphive-default.xml.templatehive-site.xml
bash-4.1$cphive-env.sh.templatehive-env.sh
bash-4.1$cphive-exec-log4j.properties.templatehive-exec-
log4j.properties
bash-4.1$cphive-log4j.properties.templatehive-log4j.properties
5. Modifytheconfigurationfileat$HIVE_HOME/conf/hive-env.sh:
#SetHADOOP_HOMEtopointtoaspecificHadoopinstalldirectory
exportHADOOP_HOME=/home/hivebooks/hadoop-2.2.0
#HiveConfigurationDirectorycanbeaccessedat:
exportHIVE_CONF_DIR=/home/hivebooks/apache-hive-1.0.0-bin/conf
6. Modifytheconfigurationfileat$HIVE_HOME/conf/hive-site.xml.Therearesomeimportantparametersthatneedspecialattention:
www.it-ebooks.info
hive.metastore.warehourse.dir:ThisisthepathforHivewarehousestorage.Bydefaultitis/user/hive/warehouse.hive.exec.scratchdir:Thisisthetemporarydatafilepath.Bydefaultitis/tmp/hive-${user.name}.
Bydefault,HiveusestheDerby(http://db.apache.org/derby/)databaseasthemetadatastore.Hivecanalsouseotherdatabases,suchasPostgreSQL(http://www.postgresql.org/)orMySQL(http://www.mysql.com/)asthemetadatastore.ToconfigureHivetouseotherdatabases,thefollowingparametersshouldbeconfigured:
javax.jdo.option.ConnectionURL//thedatabaseURL
javax.jdo.option.ConnectionDriverName//theJDBCdrivername
javax.jdo.option.ConnectionUserName//databaseusername
javax.jdo.option.ConnectionPassword//databasepassword
ThefollowingisanexamplesettingusingMySQLasthemetastoredatabase:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://myhost:3306/hive?createDatabase
IfNotExist=true</value>
<description>JDBCconnectstringforaJDBCmetastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>DriverclassnameforaJDBCmetastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>usernametouseagainstmetastoredatabase</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>passwordtouseagainstmetastoredatabase</description>
</property>
MakesuretheMySQLJDBCdriverisavailableat$HIVE_HOME/lib.
NoteThedifferencesbetweenanembedDerbydatabaseandanexternaldatabaseisthatanexternaldatabaseoffersasharedservicesothatuserscansharethemetadataofHive.However,anembeddatabaseisonlyvisibletothelocalusers.
CreatefoldersandgrantproperwritepermissionstotheusergroupintheHDFSfolder:
bash-4.1$hdfsdfs–mkdir/tmp
bash-4.1$hdfsdfs–mkdir/user/hive/warehouse
bash-4.1$hdfsdfs-chmodg+w/tmp
bash-4.1$hdfsdfs-chmodg+w/user/hive/warehouse
www.it-ebooks.info
That’sallaboutApacheHiveinstallation.InoneoftheHivenodesinstalled,typehivetoentertheHivecommand-lineenvironment(hive>),whichverifiesHiveissuccessfullyinstalled.
www.it-ebooks.info
www.it-ebooks.info
InstallingHivefromvendorpackagesRightnow,manycompanies,suchasCloudera,MapR,IBM,andHortonworks,havepackagedHadoopintomoreeasilymanageabledistributions.Eachcompanytakesaslightlydifferentstrategy,buttheconsensusforallofthesepackagesistomakeHadoopeasiertouseforenterprise.Forexample,wecaneasilyinstallHivefromClouderaDistributedHadoop(CDH),whichcanbedownloadedfromhttp://www.cloudera.com/content/cloudera/en/downloads/cdh.html.
OnceCDHisinstalledtohavetheHadoopenvironmentready,wecanaddHivetotheHadoopclusterbyfollowingafewsteps:
1. LogintotheClouderamanagerandclickonthedropdownbuttonaftertheclusternametochooseAddaService.
Clouderamanagermainpage
2. InthefirstAddServiceWizardpage,chooseHivetoinstall.
www.it-ebooks.info
3. InthesecondAddServiceWizardpage,setthedependenciesfortheservice.SentryistheauthorizationpolicyserviceforHive.
4. InthethirdAddServiceWizardpage,choosetheproperhostsforHiveServer2,HiveMetastoreServer,WebHCatServer,andGateway.
5. InthefourthAddServiceWizardpage,configureHiveMetastoreServerdatabaseconnections.
www.it-ebooks.info
6. InthelastpageofAddServiceWizard,reviewthechangesontheHivewarehousedirectoryandmetastoreserverportnumber.KeepthedefaultvaluesandclickontheContinuebuttontostartinstallingtheHiveservice.Onceitiscomplete,closethewizardtofinishtheHiveinstallation.
NoteHivecanalsobeinstalledalongwithotherserviceswhenwefirstinstallCDHintheclusterorwecandirectlyimportthevendors’quick-startHadoopvirtualmachineimage.
www.it-ebooks.info
www.it-ebooks.info
StartingHiveinthecloudRightnow,AmazonEMR,ClouderaDirector,andMicrosoftAzureHDInsightServicearesomeofthemajorvendorsofferingmaturedHadoopandHiveservicesinthecloud.UsingthecloudversionofHiveisveryconvenient.Italmostrequestsnoinstallationandsetup.
AmazonEMR(http://aws.amazon.com/elasticmapreduce/)istheearliestHadoopserviceinthecloud.However,itisnotapureopensourcedversionofHadoop,butiscustomizedtorunonlyonAWScloud.ClouderaisoneofthefirstfewplayersthatofferedopensourceHadoopsolutionstotheenterprise.SincethemiddleofOctober2014,ClouderahasdeliveredClouderaDirector(http://www.cloudera.com/content/cloudera/en/products-and-services/director.html),whichopensupHadoopdeploymentsinthecloudthroughasimple,self-serviceinterface,andisfullysupportedonAmazonWebServices.WindowsAzureHDInsightService(http://azure.microsoft.com/en-us/documentation/services/hdinsight/)isaservicethatdeploysandprovisionsApacheHadoopclustersintheAzurecloud.AlthoughHadoopwasfirstbuiltonLinux,HortonworksandMicrosofthavepartneredtobringthebenefitsofApacheHadooptotheWindowsAzurecloud.
TheconsensusamongallthevendorshereistoallowtheenterprisetoprovisionhighlyavailableHadoopclusterspoweredwithflexibility,security,management,andgovernancefunctionalitieswithaverysimpleuserinterface.
www.it-ebooks.info
www.it-ebooks.info
UsingtheHivecommandlineandBeelineHivefirststartedwithHiveServer1.However,thisversionoftheHiveserverwasnotverystable.Itsometimessuspendedorblockedclients’connectionquietly.Sinceversion11,HiveincludesanewHiveservercalledHiveSever2asanadditiontoHiveServer1.HiveServer2isanenhancedHiveserverdesignedformulticlientconcurrencyandimprovedauthentication.HiveServer2alsosupportsBeelineasthealternativecommand-lineinterface.HiveServer1isdeprecatedandremovedfromHivesinceversion1.0.0.
TheprimarydifferencebetweenthetwoHiveserversishowtheclientsconnecttoHive.HiveCLIisanApacheThrift-basedclient,andBeelineisaJDBCclientbasedonSQLLine(http://sqlline.sourceforge.net/)CLI.TheHiveCLIdirectlyconnectstotheHivedriversandrequiresinstallingHiveonthesamemachineastheclient.However,BeelineconnectstoHiveServer2throughJDBCconnectionsanddoesnotrequiretheinstallationofHivelibrariesonthesamemachineastheclient.ThatmeanswecanrunBeelineremotelyfromoutsideoftheHadoopcluster.
ThefollowingtableisthecommonlyusedcommandsforbothBeelineandHiveCLI.FormoreusageofHiveServer2andBeeline,refertohttps://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients.
Purpose HiveServer2Beeline HiveServer1CLI
Serverconnection beeline–u<jdbcurl>-n<username>-p<password> hive-h<hostname>-p<port>
Help beeline-horbeeline--help hive-H
Runquery beeline-e<queryinquotes>
beeline-f<queryfilename>
hive-e<queryinquotes>
hive-f<queryfilename>
Definevariablebeeline--hivevarkey=value.
ThisisavailableafterHive0.13.0.hive--hivevarkey=value
Thefollowingisthecommand-linesyntaxinBeelineorHiveCLI:
Purpose HiveServer2Beeline HiveServer1CLI
Entermode beeline hive
Connect !connect<jdbcurl> n/a
Listtables !table showtables;
Listcolumns !column<table_name> desc<table_name>;
Runquery <HQLquery>; <HQLquery>;
Saveresultset !record<file_name>
!recordN/A
www.it-ebooks.info
RunshellCMD!shls
ThisisavailablesinceHive0.14.0.!ls;
RundfsCMD dfs-ls dfs-ls;
RunfileofSQL !run<file_name> source<file_name>;
CheckHiveversion !dbinfo !hive--version;
Quitmode !quit quit;
NoteForBeeline,;isnotneededafterthecommandthatstartswith!.
WhenrunningaqueryinHiveCLI,theMapReducestatisticsinformationisshownintheconsolescreenwhileprocessing,whereasBeelinedoesnot.
BothBeelineandHiveCLIdonotsupportrunningapastedquerywith<tab>inside,because<tab>isusedforautocompletebydefaultintheenvironment.Alternatively,runningthequeryfromfileshasnosuchissues.
HiveCLIshowstheexactlineandpositionoftheHivequeryorsyntaxerrorswhenthequeryhasmultiplelines.However,Beelineprocessesthemultiple-linequeryasasingleline,soonlythepositionisshownforqueryorsyntaxerrorswiththelinenumberas1forallinstances.Forthisaspect,HiveCLIismoreconvenientthanBeelinefordebuggingtheHivequery.
InbothHiveCLIandBeeline,usingtheupanddownarrowkeyscanretrieveupto10,000previouscommands.The!historycommandcanbeusedinBeelinetoshowallhistory.
BothHiveCLIandBeelinesupportsvariablesubstitution;refertohttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution.
AlistofHiveconfigurationsettingsandpropertiescanbeaccessedandoverwrittenbythesetkeywordfromthecommand-lineenvironment.Formoredetails,refertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties.
www.it-ebooks.info
www.it-ebooks.info
TheHive-integrateddevelopmentenvironmentBesidesthecommand-lineinterface,thereareafewintegrateddevelopmentenvironment(IDE)toolsavailableforHivedevelopment.OneofthebestisOracleSQLDeveloper,whichleveragesthepowerfulfunctionalitiesofOracleIDEandistotallyfreetouse.IfwehavetouseOraclealongwithHiveinaproject,itisquiteconvenienttoswitchbetweenthemonlyfromthesameIDE.
OracleSQLdeveloperhassupportedHivesinceversion4.0.3.ConfiguringittoworkwithHiveisquitestraightforward.ThefollowingareafewstepstoconfiguretheIDEtoconnecttoHive:
1. DownloadHiveJDBCdriversfromthevendorwebsite,suchasCloudera.2. UnziptheJDBCversion4drivertoalocaldirectory.3. StartOracleSQLDeveloperandnavigatetoPreferences|Database|ThirdParty
JDBCDrivers.4. AddalloftheJARfilescontainedintheunzippeddirectorytotheThird-party
JDBCDriverPathsettingasfollows:
SQLdeveloperconfiguration
5. ClickontheOKbuttonandrestartOracleSQLDeveloper.6. CreatenewconnectionsintheHivetabgivingaproperConnectionName,
www.it-ebooks.info
Username,Password,Hostname(Hiveserverhostname),Port,andDatabase.Then,clickontheAddandConnectbuttonstoconnecttoHive.
SQLdeveloperconnections
InOracleSQLDeveloper,wecanrunallHiveinteractivecommandsaswellasHivequeries.WecanalsoleveragethepowerofOracleSQLDevelopertobrowseandexportdataintoaHivetablefromthegraphicuserinterfaceandwizard.
BesidesHiveIDE,Hivealsohasitsownbuilt-inwebinterface,HiveWebInterface.However,itisnotpowerfulandisnotbeingusedveryoften.Hue(http://gethue.com/)isanotherwebinterfacefortheHadoopecosystem,includingHive.Itisaverypowerfulanduser-friendlywebuserinterface.MoredetailsaboutusingHuewithHiveareintroducedinChapter10,WorkingwithOtherTools.
www.it-ebooks.info
www.it-ebooks.info
SummaryInthischapter,weintroducedthesetupofHiveindifferentenvironmentswithpropersettings.WealsolookedintoafewoftheHiveinteractivecommandsandqueriesinHiveCLI,Beeline,andIDEs.Aftergoingthroughthischapter,weshouldbeabletosetupourownHiveenvironmentlocallyanduseHivefromCLIorIDEtools.
Inthenextchapter,wewilldiveintothedetailsofHivedatadefinitionlanguages.
www.it-ebooks.info
www.it-ebooks.info
Chapter3.DataDefinitionandDescriptionThischapterintroducesthebasicdatatypes,datadefinitionlanguage,andschemainHivetodescribedata.Italsocoversthebestpracticestodescribedatacorrectlyandeffectivelybyusinginternalorexternaltables,partitions,buckets,andviews.
Inthischapter,wewillcoverthefollowingtopics:
HiveprimitiveandcomplexdatatypesDatatypeconversionsHivetablesHivepartitionsHivebucketsHiveviews
www.it-ebooks.info
UnderstandingHivedatatypesHivedatatypesarecategorizedintotwotypes:primitiveandcomplexdatatypes.Stringandintegerarethemostusefulprimitivetypes,whicharesupportedbymostHivefunctions.
TipDownloadingtheexamplecode
Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou
Thedetailsofprimitivetypesareasfollows:
Primitivedatatype Description Example
TINYINTIthas1bytefrom-128to127.ThepostfixisY.Itisusedasasmallrangeofnumbers. 10Y
SMALLINTIthas2bytesfrom-32,768to32,767.ThepostfixisS.Itisusedasaregulardescriptivenumber. 10S
INT Ithas4bytesfrom-2,147,483,648to2,147,483,647. 10
BIGINTIthas8bytesfrom-9,223,372,036,854,775,808to9,223,372,036,854,775,807.ThepostfixisL. 100L
FLOAT
Thisisa4-bytesingleprecisionfloatingpointnumberfrom1.40129846432481707e-45to3.40282346638528860e+38(positiveornegative).Scientificnotationisnotyetsupported.Itstoresverycloseapproximationsofnumericvalues.
1.2345679
DOUBLE
Thisisan8-bytedoubleprecisionfloatingpointnumberfrom4.94065645841246544e-324dto1.79769313486231570e+308d(positiveornegative).Scientificnotationisnotyetsupported.Itstoresverycloseapproximationsofnumericvalues.
1.2345678901234567
DECIMAL
ThiswasintroducedinHive0.11.0withahardcodeprecisionof38digits.Hive0.13.0introduceduserdefinableprecisionandscale.Itisaround1039-1to1-1038.Decimaldatatypesstoreexactrepresentationsofnumericvalues.Thedefaultdefinitionofthistypeisdecimal(10,0).
DECIMAL(3,2)for3.14
BINARY ThiswasintroducedinHive0.8.0andonlysupportsCASTtoSTRINGandviceversa. 1011
BOOLEAN ThisisaTRUEorFALSEvalue. TRUE
STRINGThisincludescharactersexpressedwitheithersinglequotes(‘)ordoublequotes(“).HiveusesC-styleescapingwithinthestrings.Themaxsizeisaround2G. ‘Books’or“Books”
www.it-ebooks.info
CHAR ThisisavailablestartingwithHive0.13.0.MostUDFwillworkforthistypeafterHive0.14.0.Themaximumlengthisfixedat255.
‘US’or“US”
VARCHAR
ThisisavailablestartingwithHive0.12.0.MostUDFwillworkforthistypeafterHive0.14.0.Themaximumlengthisfixedat65355.Ifastringvaluebeingconverted/assignedtoavarcharvalueexceedsthelengthspecified,thestringissilentlytruncated.
‘Books’or“Books”
DATEThisdescribesaspecificyear,month,anddayintheformatofYYYY-MM-DD.ItisavailablesinceHive0.12.0.Therangeofdateisfrom0000-01-01to9999-12-31. ‘2013-01-01’
TIMESTAMP
Thisdescribesaspecificyear,month,day,hours,minutes,seconds,andmillisecondsintheformatofYYYY-MM-DDHH:MM:SS[.fff…].ItisavailablesinceHive0.8.0.
‘2013-01-0112:00:01.345’
Hivehasthreemaincomplextypes:ARRAY,MAP,andSTRUCT.Thesedatatypesarebuiltontopoftheprimitivedatatypes.ARRAYandMAParesimilartothatinJava.STRUCTisarecordtype,whichmaycontainasetofanytypeoffields.Complextypesallowthenestingoftypes.Thedetailsofcomplextypesareasfollows:
Complexdatatype
Description Example
ARRAYThisisalistofitemsofthesametype,suchas(val1,val2,andsoon).Youcanaccessthevalueusingarray_name[index],forexample,fruit[0]='apple'. [‘apple’,‘orange’,‘mango’]
MAPThisisasetofkey-valuepairs,suchas(key1,val1,key2,val2,andsoon).Youcanaccessthevalueusingmap_name[key],forexample,fruit[1]="apple". {1:“apple”,2:“orange”}
STRUCT
Thisisauser-definedstructureofanytypeoffields,suchas{val1,val2,val3,andsoon}.Bydefault,STRUCTfieldnameswillbecol1,col2,andsoon.Youcanaccessthevalueusingstructs_name.column_name,forexample,fruit.col1=1.
{1,“apple”}
NAMED
STRUCT
Thisisauser-definedstructureofanynumberoftypedfields,suchas(name1,val1,name2,val2,andsoon).Youcanaccessthevalueusingstructs_name.column_name,forexample,fruit.apple="gala".
{“apple”:“gala”,“weightkg”:1}
UNIONThisisastructurethathasexactlyanyoneofthespecifieddatatypes.ItisavailablesinceHive0.7.0.Itisnotcommonlyused. {2:[“apple”,“orange”]}
NoteForMAP,thetypeofkeysandvaluesareunified.However,STRUCTismoreflexible.STRUCTismorelikeatablewhereasMAPismorelikeanARRAYwithacustomizedindex.
ThefollowingisashortpracticeforallthecommonlyusedHivetypes.ThedetailsoftheCREATE,LOAD,andSELECTstatementswillbedescribedlater.Let’stakealookattheprocess:
1. Preparethedataasfollows:
-bash-4.1$viemployee.txt
www.it-ebooks.info
Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead
Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shelley|NewYork|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
2. LogintoBeelinewiththeproperHiveServer2hostname,portnumber,databasename,username,andpassword:
-bash-4.1$beeline
beeline>!connectjdbc:hive2://localhost:10000/default
scancompletein20msConnectingto
jdbc:hive2://localhost:10000/default
Enterusernameforjdbc:hive2://localhost:10000/default:dayongdEnter
passwordforjdbc:hive2://localhost:10000/default:
3. CreateatableusingARRAY,MAP,andSTRUCTcompositedatatypes:
jdbc:hive2://>CREATETABLEemployee
.......>(
.......>namestring,
.......>work_placeARRAY<string>,
.......>sex_ageSTRUCT<sex:string,age:int>,
.......>skills_scoreMAP<string,int>,
.......>depart_titleMAP<string,ARRAY<string>>
.......>)
.......>ROWFORMATDELIMITED
.......>FIELDSTERMINATEDBY'|'
.......>COLLECTIONITEMSTERMINATEDBY','
.......>MAPKEYSTERMINATEDBY':';
Norowsaffected(0.149seconds)
4. Verifythetable’screation:
jdbc:hive2://>!tableemployee
+---------+------------+------------+--------------+---------+
|TABLE_CAT|TABLE_SCHEMA|TABLE_NAME|TABLE_TYPE|REMARKS|
+---------+------------+------------+--------------+---------+
||default|employee|MANAGED_TABLE||
+---------+------------+------------+--------------+---------+
jdbc:hive2://>!columnemployee
+--------------+-------------+---------------+---------------+
|TABLE_SCHEM|TABLE_NAME|COLUMN_NAME|TYPE_NAME|
+--------------+-------------+---------------+---------------+
|default|employee|name|STRING|
|default|employee|work_place|array<string>|
|default|employee|sex_age|
struct<sex:string,age:int>|
|default|employee|skills_score|map<string,int>|
|default|employee|depart_title|map<string,array<string>>
|
+--------------+-------------+---------------+---------------+
5. Loaddataintothetable:
jdbc:hive2://>LOADDATALOCALINPATH'/home/hadoop/employee.txt'
.......>OVERWRITEINTOTABLEemployee;
www.it-ebooks.info
Norowsaffected(1.023seconds)
6. Queryalltherowsinthetable:
jdbc:hive2://>SELECT*FROMemployee;
+-------+-------------------+------------+-----------------+-----------
-------------------+
|name|work_place|sex_age|skills_score|
depart_title|
+-------+-------------------+------------+-----------------+-----------
-------------------+
|Michael|[Montreal,Toronto]|[Male,30]|{DB=80}|{Product=
[Developer,Lead]}|
|Will|[Montreal]|[Male,35]|{Perl=85}|{Test=
[Lead],Product=[Lead]}|
|Shelley|[NewYork]|[Female,27]|{Python=80}|{Test=
[Lead],COE=[Architect]}|
|Lucy|[Vancouver]|[Female,57]|{Sales=89,HR=94}|{Sales=
[Lead]}|
+-------+-------------------+------------+-----------------+-----------
-------------------+
4rowsselected(0.677seconds)
7. Querythewholearrayandeacharraycolumninthetable:
jdbc:hive2://>SELECTwork_placeFROMemployee;
+----------------------+
|work_place|
+----------------------+
|[Montreal,Toronto]|
|[Montreal]|
|[NewYork]|
|[Vancouver]|
+----------------------+
4rowsselected(27.231seconds)
jdbc:hive2://>SELECTwork_place[0]AScol_1,
.......>work_place[1]AScol_2,work_place[2]AScol_3
.......>FROMemployee;
+------------+----------+--------+
|col_1|col_2|col_3|
+------------+----------+--------+
|Montreal|Toronto||
|Montreal|||
|NewYork|||
|Vancouver|||
+------------+----------+--------+
4rowsselected(24.689seconds)
8. Querythewholestructandeachstructcolumninthetable:
jdbc:hive2://>SELECTsex_ageFROMemployee;
+---------------+
|sex_age|
+---------------+
|[Male,30]|
|[Male,35]|
www.it-ebooks.info
|[Female,27]|
|[Female,57]|
+---------------+
4rowsselected(28.91seconds)
jdbc:hive2://>SELECTsex_age.sex,sex_age.ageFROMemployee;
+---------+------+
|sex|age|
+---------+------+
|Male|30|
|Male|35|
|Female|27|
|Female|57|
+---------+------+
4rowsselected(26.663seconds)
9. Querythewholemapandeachmapcolumninthetable:
jdbc:hive2://>SELECTskills_scoreFROMemployee;
+--------------------+
|skills_score|
+--------------------+
|{DB=80}|
|{Perl=85}|
|{Python=80}|
|{Sales=89,HR=94}|
+--------------------+
4rowsselected(32.659seconds)
jdbc:hive2://>SELECTname,skills_score['DB']ASDB,
.......>skills_score['Perl']ASPerl,
.......>skills_score['Python']ASPython,
.......>skills_score['Sales']asSales,
.......>skills_score['HR']asHR
.......>FROMemployee;
+----------+-----+-------+---------+--------+-----+
|name|db|perl|python|sales|hr|
+----------+-----+-------+---------+--------+-----+
|Michael|80|||||
|Will||85||||
|Shelley|||80|||
|Lucy||||89|94|
+----------+-----+-------+---------+--------+-----+
4rowsselected(24.669seconds)
NoteNotethatthecolumnnameshownintheresultsetforHiveisalwaysinlowercaseletters.
10. Querythecompositetypeinthetable:
jdbc:hive2://>SELECTdepart_titleFROMemployee;
+---------------------------------+
|depart_title|
+---------------------------------+
|{Product=[Developer,Lead]}|
www.it-ebooks.info
|{Test=[Lead],Product=[Lead]}|
|{Test=[Lead],COE=[Architect]}|
|{Sales=[Lead]}|
+---------------------------------+
4rowsselected(30.583seconds)
jdbc:hive2://>SELECTname,
.......>depart_title['Product']ASProduct,
.......>depart_title['Test']ASTest,
.......>depart_title['COE']ASCOE,
.......>depart_title['Sales']ASSales
.......>FROMemployee;
+--------+--------------------+---------+-------------+------+
|name|product|test|coe|sales|
+--------+--------------------+---------+-------------+------+
|Michael|[Developer,Lead]||||
|Will|[Lead]|[Lead]|||
|Shelley||[Lead]|[Architect]||
|Lucy||||[Lead]|
+--------+--------------------+---------+-------------+------+
4rowsselected(26.641seconds)
jdbc:hive2://>SELECTname,
.......>depart_title['Product'][0]ASproduct_col0,
.......>depart_title['Test'][0]AStest_col0
.......>FROMemployee;
+----------+---------------+------------+
|name|product_col0|test_col0|
+----------+---------------+------------+
|Michael|Developer||
|Will|Lead|Lead|
|Shelley||Lead|
|Lucy|||
+----------+---------------+------------+
4rowsselected(26.659seconds)
NoteThedefaultdelimitersinHiveareasfollows:
Rowdelimiter:ThiscanbeusedwithCtrl+Aor^A(Use\001whencreatingthetable)Collectionitemdelimiter:ThiscanbeusedwithCtrl+Bor^B(\002)Mapkeydelimiter:ThiscanbeusedwithCtrl+Cor^C(\003)
Ifthedelimiterisoveriddenduringthetablecreation,itonlyworkswhenusedintheflatstructure.ThisisstillalimitationinHivedescribedinApacheJiraHive-365(https://issues.apache.org/jira/browse/HIVE-365).
Fornestedtypes,forexample,thedepart_titlecolumnintheprecedingtables,thelevelofnestingdeterminesthedelimiter.UsingARRAYofARRAYasanexample,thedelimitersfortheouterARRAYareCtrl+B(\002)characters,asexpected,butfortheinnerARRAYtheyareCtrl+C(\003)characters,thenextdelimiterinthelist.ForourexampleofusingMAP
www.it-ebooks.info
ofARRAY,theMAPkeydelimiteris\003,andtheARRAYdelimiterisCtrl+Dor^D(\004).
www.it-ebooks.info
www.it-ebooks.info
DatatypeconversionsSimilartoJava,Hivesupportsbothimplicittypeconversionandexplicittypeconversion.
Primitivetypeconversionfromanarrowtoawidertypeisknownasimplicitconversion.However,thereverseconversionisnotallowed.Alltheintegralnumerictypes,FLOAT,andSTRINGcanbeimplicitlyconvertedtoDOUBLE,andTINYINT,SMALLINT,andINTcanallbeconvertedtoFLOAT.BOOLEANtypescannotbeconvertedtoanyothertype.IntheApacheHivewiki,thereisadatatypecrosstabledescribingtheallowedimplicitconversionbetweeneverytwotypesinHiveandthiscanbefoundathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types.
ExplicittypeconversionisusingtheCASTfunctionwiththeCAST(valueASTYPE)syntax.Forexample,CAST('100'ASINT)willconvertthestring100totheintegervalue100.Ifthecastfails,suchasCAST('INT'ASINT),thefunctionreturnsNULL.Inaddition,theBINARYtypecanonlycasttoSTRING,thencastfromSTRINGtoothertypes,ifneeded.
www.it-ebooks.info
www.it-ebooks.info
HiveDataDefinitionLanguageHiveDataDefinitionLanguage(DDL)isasubsetofHiveSQLstatementsthatdescribethedatastructureinHivebycreating,deleting,oralteringschemaobjectssuchasdatabases,tables,views,partitions,andbuckets.MostHiveDDLstatementsstartwiththekeywordsCREATE,DROP,orALTER.ThesyntaxofHiveDDLisverysimilartotheDDLinSQL.ThecommentsinHivestartfrom--.
www.it-ebooks.info
www.it-ebooks.info
HivedatabaseThedatabaseinHivedescribesacollectionoftablesthatareusedforasimilarpurposeorbelongtothesamegroups.Ifthedatabaseisnotspecified,thedefaultdatabaseisused.Wheneveranewdatabaseiscreated,Hivecreatesadirectoryforeachdatabaseat/user/hive/warehouse,definedinhive.metastore.warehouse.dir.Forexample,themyhivebookdatabaseislocatedat/user/hive/datawarehouse/myhivebook.db.However,thedefaultdatabasedoesn’thaveitsowndirectory.ThefollowingisthecoreDDLforHivedatabases:
Createthedatabasewithoutcheckingwhetherthedatabasealreadyexists:
jdbc:hive2://>CREATEDATABASEmyhivebook;
Createthedatabaseandcheckwhetherthedatabasealreadyexists:
jdbc:hive2://>CREATEDATABASEIFNOTEXISTSmyhivebook;
Createthedatabasewithlocation,comments,andmetadatainformation:
jdbc:hive2://>CREATEDATABASEIFNOTEXISTSmyhivebook
.......>COMMENT'hivedatabasedemo'
.......>LOCATION'/hdfs/directory'
.......>WITHDBPROPERTIES('creator'='dayongd','date'='2015-01-
01');
Showanddescribethedatabasewithwildcards:
jdbc:hive2://>SHOWDATABASES;
+----------------+
|database_name|
+----------------+
|default|
+----------------+
1rowselected(1.7seconds)
jdbc:hive2://>SHOWDATABASESLIKE'my.*';
jdbc:hive2://>DESCRIBEDATABASEdefault;
+-------+----------------------+-----------------------------+
|db_name|comment|location|
+-------+----------------------+-----------------------------+
|default|DefaultHivedatabase
|hdfs://localhost:8020/user/hive/warehouse|
+-------+----------------------+-----------------------------+
1rowselected(1.352seconds)
Usethedatabase:
jdbc:hive2://>USEmyhivebook;
Droptheemptydatabase:
jdbc:hive2://>DROPDATABASEIFEXISTSmyhivebook;
Note
www.it-ebooks.info
NotethatHivekeepsthedatabaseandthetableindirectorymode.Inordertoremovetheparentdirectory,weneedtoremovethesubdirectoriesfirst.Bydefault,thedatabasecannotbedroppedifitisnotempty,unlessCASCADEisspecified.CASCADEdropsthetablesinthedatabaseautomaticallybeforedroppingthedatabase.
DropthedatabasewithCASCADE:
jdbc:hive2://>DROPDATABASEIFEXISTSmyhivebookCASCADE;
Alterthedatabaseproperties.TheALTERDATABASEstatementcanonlyapplytothetablepropertiesandroles(Hive0.13.0andlater)onthetable.Theothermetadataaboutthedatabasecannotbechanged:
jdbc:hive2://>ALTERDATABASEmyhivebook
.......>SETDBPROPERTIES('edited-by'='Dayong');
jdbc:hive2://>ALTERDATABASEmyhivebook
SETOWNERuserdayongd;
NoteSHOWandDESCRIBE
TheSHOWandDESCRIBEkeywordsinHiveareusedtoshowthedefinitioninformationformostoftheHiveobjects,suchastables,partitions,andsoon.
TheSHOWstatementsupportsawiderangeofHiveobjects,suchastables,tables’properties,tableDDL,index,partitions,columns,functions,locks,roles,configurations,transactions,andcompactions.
TheDESCRIBEstatementsupportsasmallrangeofHiveobjects,suchasdatabases,tables,views,columns,andpartitions.However,theDESCRIBEstatementisabletoprovidemoredetailedinformationcombinedwiththeEXTENDEDorFORMATTEDkeywords.
Inthisbook,thereisnosinglesectiontointroduceSHOWandDESCRIBE,butweintroducetheirusageinlinewithotherHQLthroughtheremainingchapters.
www.it-ebooks.info
www.it-ebooks.info
HiveinternalandexternaltablesTheconceptofatableinHiveisverysimilartothetableintherelationaldatabase.Eachtableassociateswithadirectoryconfiguredin${HIVE_HOME}/conf/hive-site.xmlinHDFS.Bydefault,itis/user/hive/warehouseinHDFS.Forexample,/user/hive/warehouse/employeeiscreatedbyHiveinHDFSfortheemployeetable.Allthedatainthetablewillbekeptinthedirectory.TheHivetableisalsoreferredtoasinternalormanagedtables.
WhenthereisdataalreadyinHDFS,anexternalHivetablecanbecreatedtodescribethedata.ItiscalledEXTERNALbecausethedataintheexternaltableisspecifiedintheLOCATIONpropertiesinsteadofthedefaultwarehousedirectory.Whenkeepingdataintheinternaltables,Hivefullymanagesthelifecycleofthetableanddata.Thismeansthedataisremovedoncetheinternaltableisdropped.Iftheexternaltableisdropped,thetablemetadataisdeletedbutthedataiskept.Mostofthetime,anexternaltableispreferredtoavoiddeletingdataalongwithtablesbymistake.ThefollowingareDDLsforHiveinternalandexternaltableexamples:
Showthedatabasefile’slocationandcontentoftheemployeeinternaltable:
bash-4.1$vi/home/hadoop/employee.txt
Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead
Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shelley|NewYork|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
Createtheinternaltableandloadthedata:
jdbc:hive2://>CREATETABLEIFNOTEXISTSemployee_internal
.......>(
.......>namestring,
.......>work_placeARRAY<string>,
.......>sex_ageSTRUCT<sex:string,age:int>,
.......>skills_scoreMAP<string,int>,
.......>depart_titleMAP<STRING,ARRAY<STRING>>
.......>)
.......>COMMENT'Thisisaninternaltable'
.......>ROWFORMATDELIMITED
.......>FIELDSTERMINATEDBY'|'
.......>COLLECTIONITEMSTERMINATEDBY','
.......>MAPKEYSTERMINATEDBY':'
.......>STOREDASTEXTFILE;
Norowsaffected(0.149seconds)
jdbc:hive2://>LOADDATALOCALINPATH'/home/hadoop/employee.txt'
.......>OVERWRITEINTOTABLEemployee_internal;
Createtheexternaltableandloadthedata:
jdbc:hive2://>CREATEEXTERNALTABLEemployee_external
.......>(
.......>namestring,
.......>work_placeARRAY<string>,
www.it-ebooks.info
.......>sex_ageSTRUCT<sex:string,age:int>,
.......>skills_scoreMAP<string,int>,
.......>depart_titleMAP<STRING,ARRAY<STRING>>
.......>)
.......>COMMENT'Thisisanexternaltable'
.......>ROWFORMATDELIMITED
.......>FIELDSTERMINATEDBY'|'
.......>COLLECTIONITEMSTERMINATEDBY','
.......>MAPKEYSTERMINATEDBY':'
.......>STOREDASTEXTFILE
.......>LOCATION'/user/dayongd/employee';
Norowsaffected(1.332seconds)
jdbc:hive2://>LOADDATALOCALINPATH'/home/hadoop/employee.txt'...
....>OVERWRITE
INTOTABLEemployee_external;
NoteCREATETABLE
TheHivetabledoesnothaveconstraintssuchasadatabaseyet.
IfthefolderinthepathdoesnotexistintheLOCATIONproperty,Hivewillcreatethatfolder.IfthereisanotherfolderinsidethefolderspecifiedintheLOCATIONproperty,HivewillNOTreporterrorswhencreatingthetable,butwillreportanerrorwhenqueryingthetable.
Atemporarytable,whichisautomaticallydeletedattheendoftheHivesession,issupportedinHive0.14.0byHIVE-7090(https://issues.apache.org/jira/browse/HIVE-7090)throughtheCREATETEMPORARYTABLEstatement.
FortheSTOREASproperty,itissettoASTEXTFILEbydefault.Otherfileformatvalues,suchasSEQUENCEFILE,RCFILE,ORC,AVRO(sinceHive0.14.0),andPARQUET(sinceHive0.13.0)canalsobespecified.
Createthetableasselect(CTAS):
jdbc:hive2://>CREATETABLEctas_employee
.......>ASSELECT*FROMemployee_external;
Norowsaffected(1.562seconds)
NoteCTAS
CTAScopiesthedataaswellastabledefinitions.ThetablecreatedbyCTASisatomic;thismeansthatotherusersdonotseethetableuntilallthequeryresultsarepopulated.CTAShasthefollowingrestrictions:
ThetablecreatedcannotbeapartitionedtableThetablecreatedcannotbeanexternaltableThetablecreatedcannotbealistbucketingtable
ACTASstatementwilltriggeramapjobforpopulatingthedata;evenSELECT*itself
www.it-ebooks.info
doesnottriggeranyMapReducejob.
CTASwithCommonTableExpression(CTE)canbecreatedasfollows:
jdbc:hive2://>CREATETABLEcte_employeeAS
.......>WITHr1AS
.......>(SELECTnameFROMr2
.......>WHEREname='Michael'),
.......>r2AS
.......>(SELECTnameFROMemployee
.......>WHEREsex_age.sex='Male'),
.......>r3AS
.......>(SELECTnameFROMemployee
.......>WHEREsex_age.sex='Female')
.......>SELECT*FROMr1UNIONALLselect*FROMr3;
Norowsaffected(61.852seconds)
jdbc:hive2://>SELECT*FROMcte_employee;
+----------------------------+
|cte_employee.name|
+----------------------------+
|Michael|
|Shelley|
|Lucy|
+----------------------------+
3rowsselected(0.091seconds)
NoteCTE
CTEisavailablesinceHive0.13.0.ItisatemporaryresultsetderivedfromasimpleSELECTqueryspecifiedinaWITHclause,followedbySELECTorINSERTkeywordtooperatethisresultset.TheCTEisdefinedonlywithintheexecutionscopeofasinglestatement.OneormoreCTEscanbeusedinanestedorchainedwaywithHivekeywords,suchastheSELECT,INSERT,CREATETABLEASSELECT,orCREATEVIEWASSELECTstatements.
Emptytablescanbecreatedintwowaysasfollows:
1. UseCTASasshownhere:
jdbc:hive2://>CREATETABLEempty_ctas_employeeAS
.......>SELECT*FROMemployee_internalWHERE1=2;
Norowsaffected(213.356seconds)
2. UseLIKEasshownhere:
jdbc:hive2://>CREATETABLEempty_like_employee
.......>LIKEemployee_internal;
Norowsaffected(0.115seconds)
Checktherowcountsforbothtables:
jdbc:hive2://>SELECTCOUNT(*)ASrow_cnt
www.it-ebooks.info
.......>FROMempty_ctas_employee;
+----------+
|row_cnt|
+----------+
|0|
+----------+
1rowselected(51.228seconds)
jdbc:hive2://>SELECTCOUNT(*)ASrow_cnt
.......>FROMempty_like_employee;
+----------+
|row_cnt|
+----------+
|0|
+----------+
1rowselected(41.628seconds)
NoteTheLIKEway,whichisfaster,doesnottriggeraMapReducejobsinceitismetadataduplicationonly.
Thedroptable’scommandremovesthemetadatacompletelyandmovesdatatoTrashortothecurrentdirectoryifTrashisconfigured:
jdbc:hive2://>DROPTABLEIFEXISTSempty_ctas_employee;
Norowsaffected(0.283seconds)
jdbc:hive2://>DROPTABLEIFEXISTSempty_like_employee;
Norowsaffected(0.202seconds)
Thetruncatetable’scommandremovesalltherowsfromatablethatshouldbeaninternaltable:
jdbc:hive2://>SELECT*FROMcte_employee;
+--------------------+
|cte_employee.name|
+--------------------+
|Michael|
|Shelley|
|Lucy|
+--------------------+
3rowsselected(0.158seconds)
jdbc:hive2://>TRUNCATETABLEcte_employee;
Norowsaffected(0.093seconds)
--Tableisemptyaftertruncate
jdbc:hive2://>SELECT*FROMcte_employee;
+--------------------+
|cte_employee.name|
+--------------------+
+--------------------+
Norowsselected(0.059seconds)
Alterthetable’sstatementstorenamethetable:
www.it-ebooks.info
jdbc:hive2://>!table
+-----------+------------------+-----------+---------------------------
+
|TABLE_SCHEM|TABLE_NAME|TABLE_TYPE|REMARKS
|
+-----------+------------------+-----------+---------------------------
+
|default|employee|TABLE|NULL
|
|default|employee_internal|TABLE|Thisisaninternaltable
|
|default|employee_external|TABLE|Thisisanexternaltable
|
|default|ctas_employee|TABLE|NULL
|
|default|cte_employee|TABLE|NULL
|
+-----------+------------------+-----------+---------------------------
+
jdbc:hive2://>ALTERTABLEcte_employeeRENAMETOc_employee;
Norowsaffected(0.237seconds)
Alterthetable’sproperties,suchascomments:
jdbc:hive2://>ALTERTABLEc_employee
.......>SETTBLPROPERTIES('comment'='Newname,comments');
Norowsaffected(0.239seconds)
jdbc:hive2://>!table
+-----------+------------------+-----------+---------------------------
+
|TABLE_SCHEM|TABLE_NAME|TABLE_TYPE|REMARKS
|
+-----------+------------------+-----------+---------------------------
+
|default|employee|TABLE|NULL
|
|default|employee_internal|TABLE|Thisisaninternaltable
|
|default|employee_external|TABLE|Thisisanexternaltable
|
|default|ctas_employee|TABLE|NULL
|
|default|c_employee|TABLE|Newname,comments
|
+-----------+------------------+-----------+---------------------------
+
Alterthetable’sdelimiterthroughSERDEPROPERTIES:
jdbc:hive2://>ALTERTABLEemployee_internalSET
.......>SERDEPROPERTIES('field.delim'='$');
Norowsaffected(0.148seconds)
Alterthetable’sfileformat:
www.it-ebooks.info
jdbc:hive2://>ALTERTABLEc_employeeSETFILEFORMATRCFILE;
Norowsaffected(0.235seconds)
Alterthetable’slocation,whichmustbeafullURIofHDFS:
jdbc:hive2://>ALTERTABLEc_employee
.......>SETLOCATION
.......>'hdfs://localhost:8020/user/dayongd/employee';
Norowsaffected(0.169seconds)
Alterthetable’senable/disableprotectiontoNO_DROP,whichpreventsatablefrombeingdropped,orOFFLINE,whichpreventsdata(notmetadata)inatablefrombeingqueried:
jdbc:hive2://>ALTERTABLEc_employeeENABLENO_DROP;
jdbc:hive2://>ALTERTABLEc_employeeDISABLENO_DROP;
jdbc:hive2://>ALTERTABLEc_employeeENABLEOFFLINE;
jdbc:hive2://>ALTERTABLEc_employeeDISABLEOFFLINE;
Alterthetable’sconcatenationtomergesmallfilesintolargerfiles:
--Converttothefileformatsupported
jdbc:hive2://>ALTERTABLEc_employeeSETFILEFORMATORC;
Norowsaffected(0.160seconds)
--Concatenatefiles
jdbc:hive2://>ALTERTABLEc_employeeCONCATENATE;
Norowsaffected(0.165seconds)
--Converttotheregularfileformat
jdbc:hive2://>ALTERTABLEc_employeeSETFILEFORMATTEXTFILE;
Norowsaffected(0.143seconds)
NoteCONCATENATE
InHiverelease0.8.0,RCFileisaddedtosupportfastblock-levelmergingofsmallRCFilesusingtheCONCATENATEcommand.InHiverelease0.14.0ORC,thefilesthatareaddedsupportfaststripe-levelmergingofsmallORCfilesusingtheCONCATENATEcommand.Otherfileformatsarenotsupportedyet.IncaseofRCFiles,themergehappensatblocklevelandORCfilesmergeatstripeleveltherebyavoidingtheoverheadofdecompressinganddecodingthedata.MapReduceistriggeredwhenperformingconcatenation.
Alterthecolumn’sdatatype:
--Checkcolumntypebeforechanges
jdbc:hive2://>DESCemployee_internal;
+----------------+-----------------------------+----------+
|col_name|data_type|comment|
+----------------+-----------------------------+----------+
|employee_name|string||
|work_place|array<string>||
|sex_age|struct<sex:string,age:int>||
|skills_score|map<string,int>||
www.it-ebooks.info
|depart_title|map<string,array<string>>||
+----------------+-----------------------------+----------+
5rowsselected(0.119seconds)
--Changecolumntypeandorder
jdbc:hive2://>ALTERTABLEemployee_internal
.......>CHANGEnameemployee_namestringAFTERsex_age;
Norowsaffected(0.23seconds)
--Verifythechanges
jdbc:hive2://>DESCemployee_internal;
+----------------+-----------------------------+----------+
|col_name|data_type|comment|
+----------------+-----------------------------+----------+
|work_place|array<string>||
|sex_age|struct<sex:string,age:int>||
|employee_name|string||
|skills_score|map<string,int>||
|depart_title|map<string,array<string>>||
+----------------+-----------------------------+----------+
5rowsselected(0.214seconds)
Alterthecolumn’stypeandorder:
jdbc:hive2://>ALTERTABLEemployee_internal
.......>CHANGEemployee_namenamestringFIRST;
Norowsaffected(0.238seconds)
--Verifythechanges
jdbc:hive2://>DESCemployee_internal;
+---------------+-----------------------------+----------+
|col_name|data_type|comment|
+---------------+-----------------------------+----------+
|name|string||
|work_place|array<string>||
|sex_age|struct<sex:string,age:int>||
|skills_score|map<string,int>||
|depart_title|map<string,array<string>>||
+---------------+-----------------------------+----------+
5rowsselected(0.119seconds)
Add/replacecolumns:
--Addcolumnstothetable
jdbc:hive2://>ALTERTABLEc_employeeADDCOLUMNS(workstring);
Norowsaffected(0.184seconds)
--Verifytheaddedcolumns
jdbc:hive2://>DESCc_employee;
+-----------+------------+----------+
|col_name|data_type|comment|
+-----------+------------+----------+
|name|string||
|work|string||
+-----------+------------+----------+
2rowsselected(0.115seconds)
--Replaceallcolumns
www.it-ebooks.info
jdbc:hive2://>ALTERTABLEc_employee
.......>REPLACECOLUMNS(namestring);
Norowsaffected(0.132seconds)
--Verifythereplacedallcolumns
jdbc:hive2://>DESCc_employee;
+-----------+------------+----------+
|col_name|data_type|comment|
+-----------+------------+----------+
|name|string||
+-----------+------------+----------+
1rowselected(0.129seconds)
NoteTheALTERcommandwillonlymodifyHive’smetadata,NOTthedata.Usersshouldmakesuretheactualdataconformswiththemetadatadefinitionmanually.
www.it-ebooks.info
www.it-ebooks.info
HivepartitionsBydefault,asimplequeryinHivescansthewholeHivetable.Thisslowsdowntheperformancewhenqueryingalarge-sizetable.TheissuecouldberesolvedbycreatingHivepartitions,whichisverysimilartowhat’sintheRDBMS.InHive,eachpartitioncorrespondstoapredefinedpartitioncolumn(s)andstoresitasasubdirectoryinthetable’sdirectoryinHDFS.Whenthetablegetsqueried,onlytherequiredpartitions(directory)ofdatainthetablearequeried,sotheI/Oandtimeofqueryisgreatlyreduced.ItisveryeasytoimplementHivepartitionswhenthetableiscreatedandcheckthepartitionscreated,asfollows:
--
Createpartitionswhencreatingtables
jdbc:hive2://>CREATETABLEemployee_partitioned
.......>(
.......>namestring,
.......>work_placeARRAY<string>,
.......>sex_ageSTRUCT<sex:string,age:int>,
.......>skills_scoreMAP<string,int>,
.......>depart_titleMAP<STRING,ARRAY<STRING>>
.......>)
.......>PARTITIONEDBY(YearINT,MonthINT)
.......>ROWFORMATDELIMITED
.......>FIELDSTERMINATEDBY'|'
.......>COLLECTIONITEMSTERMINATEDBY','
.......>MAPKEYSTERMINATEDBY':';
Norowsaffected(0.293seconds)
--Showpartitions
jdbc:hive2://>SHOWPARTITIONSemployee_partitioned;
+------------+
|partition|
+------------+
+------------+
Norowsselected(0.177seconds)
Fromtheprecedingresult,wecanseethatthepartitionisnotenabledautomatically.WehavetouseALTERTABLEADDPARTITIONtoaddpartitionstoatable.TheADDPARTITIONcommandchangesthetable’smetadata,butdoesnotloaddata.Ifthedatadoesnotexistinthepartition’slocation,querieswillnotreturnanyresults.Todropthepartitionincludingbothdataandmetadata,usetheALTERTABLEDROPPARTITIONstatementasfollows:
--Addmultiplepartitions
jdbc:hive2://>ALTERTABLEemployee_partitionedADD
.......>PARTITION(year=2014,month=11)
.......>PARTITION(year=2014,month=12);
Norowsaffected(0.248seconds)
jdbc:hive2://>SHOWPARTITIONSemployee_partitioned;
+---------------------+
|partition|
+---------------------+
www.it-ebooks.info
|year=2014/month=11|
|year=2014/month=12|
+---------------------+
2rowsselected(0.108seconds)
--Dropthepartition
jdbc:hive2://>ALTERTABLEemployee_partitioned
.......>DROPIFEXISTSPARTITION(year=2014,month=11);
jdbc:hive2://>SHOWPARTITIONSemployee_partitioned;
+---------------------+
|partition|
+---------------------+
|year=2014/month=12|
+---------------------+
1rowselected(0.107seconds)
Toavoidmanuallyaddingpartitions,dynamicpartitioninsert(ormultipartitioninsert)isdesignedfordynamicallydeterminingwhichpartitionsshouldbecreatedandpopulatedwhilescanningtheinputtable.ThispartisintroducedwithmoredetailinChapter5,DataManipulation.
Toloadoroverwritedatainpartition,wecanusetheLOADorINSERTOVERWRITEstatements.Thestatementonlyoverwritesthedatainthespecifiedpartitions.Althoughpartitioncolumnsaresubdirectorynames,wecanqueryorspecifythemintheSELECTorWHEREstatementstonarrowdowntheresultset.Thefollowingstepsshowhowtoloaddatatothepartitiontable:
Loaddatatothepartition:
jdbc:hive2://>LOADDATALOCALINPATH
.......>'/home/dayongd/Downloads/employee.txt'
.......>OVERWRITEINTOTABLEemployee_partitioned
.......>PARTITION(year=2014,month=12);
Norowsaffected(0.96seconds)
Verifythedatathatisloaded:
jdbc:hive2://>SELECTname,year,monthFROMemployee_partitioned;
+----------+-------+--------+
|name|year|month|
+----------+-------+--------+
|Michael|2014|12|
|Will|2014|12|
|Shelley|2014|12|
|Lucy|2014|12|
+----------+-------+--------+
4rowsselected(37.451seconds)
Thealtertable/partitionstatementforfileformat,location,protections,andconcatenationhasthesamesyntaxasthealtertablestatementsandisshownhere:
ALTERTABLEtable_namePARTITIONpartition_specSETFILEFORMAT
file_format;
ALTERTABLEtable_namePARTITIONpartition_specSETLOCATION'full
www.it-ebooks.info
URI';
ALTERTABLEtable_namePARTITIONpartition_specENABLENO_DROP;
ALTERTABLEtable_namePARTITIONpartition_specENABLEOFFLINE;
ALTERTABLEtable_namePARTITIONpartition_specDISABLENO_DROP;
ALTERTABLEtable_namePARTITIONpartition_specDISABLEOFFLINE;
ALTERTABLEtable_namePARTITIONpartition_specCONCATENATE;
www.it-ebooks.info
www.it-ebooks.info
HivebucketsBesidespartition,bucketisanothertechniquetoclusterdatasetsintomoremanageablepartstooptimizequeryperformance.Differentfrompartition,thebucketcorrespondstosegmentsoffilesinHDFS.Forexample,theemployee_partitionedtablefromtheprevioussectionusestheyearandmonthasthetop-levelpartition.Ifthereisafurtherrequesttousetheemployee_idasthethirdlevelofpartition,itleadstomanydeepandsmallpartitionsanddirectories.Forinstance,wecanbuckettheemployee_partitionedtableusingemployee_idasthebucketcolumn.Thevalueofthiscolumnwillbehashedbyauser-definednumberintobuckets.Therecordswiththesameemployee_idwillalwaysbestoredinthesamebucket(segmentoffiles).Byusingbuckets,Hivecaneasilyandefficientlydosampling(seeChapter6,DataAggregationandSampling)andmapsidejoins(seeChapter4,DataSelectionandScope).Anexampletocreateabuckettableisasfollows:
--Prepareanotherdatasetandtableforbuckettable
jdbc:hive2://>CREATETABLEemployee_id
.......>(
.......>namestring,
.......>employee_idint,
.......>work_placeARRAY<string>,
.......>sex_ageSTRUCT<sex:string,age:int>,
.......>skills_scoreMAP<string,int>,
.......>depart_titleMAP<string,ARRAY<string>>
.......>)
.......>ROWFORMATDELIMITED
.......>FIELDSTERMINATEDBY'|'
.......>COLLECTIONITEMSTERMINATEDBY','
.......>MAPKEYSTERMINATEDBY':';
Norowsaffected(0.101seconds)
jdbc:hive2://>LOADDATALOCALINPATH
.......>'/home/dayongd/Downloads/employee_id.txt'
.......>OVERWRITEINTOTABLEemployee_id
Norowsaffected(0.112seconds)
--Createthebucketstable
jdbc:hive2://>CREATETABLEemployee_id_buckets
.......>(
.......>namestring,
.......>employee_idint,
.......>work_placeARRAY<string>,
.......>sex_ageSTRUCT<sex:string,age:int>,
.......>skills_scoreMAP<string,int>,
.......>depart_titleMAP<string,ARRAY<string>>
.......>)
.......>CLUSTEREDBY(employee_id)INTO2BUCKETS
.......>ROWFORMATDELIMITED
.......>FIELDSTERMINATEDBY'|'
.......>COLLECTIONITEMSTERMINATEDBY','
.......>MAPKEYSTERMINATEDBY':';
Norowsaffected(0.104seconds)
www.it-ebooks.info
NoteBucketnumbers
Todefinethepropernumberofbuckets,weshouldavoidhavingtoomuchortoolittleofdataineachbucket.Abetterchoiceissomewhereneartwoblocksofdata.Forexample,wecanplan512MBofdataineachbucket,iftheHadoopblocksizeis256MB.Ifpossible,use2Nasthenumberofbuckets.
Bucketinghasclosedependencyontheunderlyingdataloaded.Toproperlyloaddatatoabuckettable,weneedtoeithersetthemaximumnumberofreducerstothesamenumberofbucketsspecifiedinthetablecreation(forexample,2)orenableenforcebucketingasfollows:
jdbc:hive2://>setmap.reduce.tasks=2;
Norowsaffected(0.026seconds)
jdbc:hive2://>sethive.enforce.bucketing=true;
Norowsaffected(0.002seconds)
Topopulatethedatatothebuckettable,wecannotuseLOADkeywordssuchaswhatwasdoneintheregulartablessinceLOADdoesnotverifythedataagainstthemetadata.Instead,INSERTshouldbeusedtopopulatethebuckettableasfollows:
jdbc:hive2://>INSERTOVERWRITETABLEemployee_id_buckets
.......>SELECT*FROMemployee_id;
Norowsaffected(75.468seconds)
--VerifythebucketsintheHDFS
-bash-4.1$hdfsdfs-ls/user/hive/warehouse/employee_id_buckets
Found2items
-rwxrwxrwx1hivehive9002014-11-0210:54
/user/hive/warehouse/employee_id_buckets/000000_0
-rwxrwxrwx1hivehive5822014-11-0210:54
/user/hive/warehouse/employee_id_buckets/000001_0
www.it-ebooks.info
www.it-ebooks.info
HiveviewsInHive,viewsarelogicaldatastructuresthatcanbeusedtosimplifyqueriesbyeitherhidingthecomplexitiessuchasjoins,subqueries,andfiltersorbyflattingthedata.UnlikesomeRDBMS,Hiveviewsdonotstoredataorgetmaterialized.OncetheHiveviewiscreated,itsschemaisfrozenimmediately.Subsequentchangestotheunderlyingtables(forexample,addingacolumn)willnotbereflectedintheview’sschema.Ifanunderlyingtableisdroppedorchanged,subsequentattemptstoquerytheinvalidviewwillfail,asfollows:
jdbc:hive2://>CREATEVIEWemployee_skills
.......>AS
.......>SELECTname,skills_score['DB']ASDB,
.......>skills_score['Perl']ASPerl,
.......>skills_score['Python']ASPython,
.......>skills_score['Sales']asSales,
.......>skills_score['HR']asHR
.......>FROMemployee;
Norowsaffected(0.253seconds)
Whencreatingviews,thereisnoMapReducejobtriggeredatallsincethisisonlyametadatachange.However,aproperMapReducejobwillbetriggeredwhenqueryingtheview.UseSHOWCREATETABLEorDESCFORMATTEDTABLEtodisplaytheCREATEVIEWstatementthatcreatedaview.ThefollowingareotherHiveviewDDLs:
Altertheviews’properties:
jdbc:hive2://>ALTERVIEWemployee_skills
.......>SETTBLPROPERTIES('comment'='Thisisaview');
Norowsaffected(0.19seconds)
Redefineviews:
jdbc:hive2://>ALTERVIEWemployee_skillsAS
.......>SELECT*fromemployee;
Norowsaffected(0.17seconds)
Dropviews:
jdbc:hive2://>DROPVIEWemployee_skills;
Norowsaffected(0.156seconds)
www.it-ebooks.info
www.it-ebooks.info
SummaryAftergoingthroughthischapter,weareabletodefineandusevariousdatatypesinHive.Weshouldknowhowtocreate,alter,anddroptables,partitions,andviewsinHiveandhowtouseexternaltables,internaltables,partitions,buckets,andviewsinHive.
Inthenextchapter,wewilldiveintothedetailsofqueryingdatabyHive.
www.it-ebooks.info
www.it-ebooks.info
Chapter4.DataSelectionandScopeThischapterisabouthowtodiscoverthedatabyqueryingthedata,linkingthedata,andlimitingthedatarangesorscopes.ThechaptermainlycoversthesyntaxandusageofHiveSELECT,WHERE,LIMIT,JOIN,andUNIONALLtooperatedatasets.
Inthischapterwewillcoverthefollowingtopics:
TheSELECTstatementThecommonJOINstatementThespecialJOIN(MAPJOIN)statementThesetoperationstatement(UNIONALL)
www.it-ebooks.info
TheSELECTstatementThemostcommonusecaseofusingHiveistoquerythedatainHadoop.Toachievethis,weneedtowriteandexecutetheSELECTstatementinHive.ThetypicalworkdonebytheSELECTstatementistoprojecttherowsmeetingqueryconditionsspecifiedintheWHEREclauseafterthetargettableandreturntheresultset.TheSELECTstatementisquiteoftenusedwithFROM,DISTINCT,WHERE,andLIMITkeywords.Wewillintroducethemthroughexamplesasfollows.
TheSELECT*statementheremeansallthecolumnsinthetableareselected.Bydefault,allrowsarereturnedincludingduplicatedrows.IftheDISTINCTkeywordisused,onlyuniquerowsfromthetableareselectedandreturned.TheLIMITkeywordisusedtolimitthenumberofrowsreturnedrandomly.Inaddition,SELECT*scansthewholetable/filewithouttriggeringMapReducejobs,soitrunsfasterthanSELECT<column_name>.SinceHive0.10.0,thesimpleSELECTstatements,suchasSELECT<column_name>FROM<table_name>LIMITn,canalsoavoidtriggeringtheMapReducejobiftheHivefetchtaskconversionisenabledbysettinghive.fetch.task.conversion=more.
Thefollowingtaskscanbedone:
Queryallorspecificcolumnsinthetable:
jdbc:hive2://>SELECT*FROMemployee;
+-------+------------------+-----------+----------------+--------------
---------------+
|name|work_place|sex_age|skills_score|
depart_title|
+-------+------------------+-----------+----------------+--------------
---------------+
|Michael|[Montreal,Toronto]|[Male,30]|{DB=80}|{Product=
[Developer,Lead]}|
|Will|[Montreal]|[Male,35]|{Perl=85}|{Test=
[Lead],Product=[Lead]}|
|Shelley|[NewYork]|[Female,27]|{Python=80}|{Test=
[Lead],COE=[Architect]}|
|Lucy|[Vancouver]|[Female,57]|{Sales=89,HR=94}|{Sales=[Lead]}
|
+-------+------------------+-----------+----------------+--------------
---------------+
4rowsselected(0.677seconds)
jdbc:hive2://>SELECTnameFROMemployee;
+----------+
|name|
+----------+
|Michael|
|Will|
|Shelley|
|Lucy|
+----------+
4rowsselected(162.452seconds)
www.it-ebooks.info
Selectauniquevalueofthespecifiedcolumn:
jdbc:hive2://>SELECTDISTINCTnameFROMemployeeLIMIT2;
+----------+
|name|
+----------+
|Lucy|
|Michael|
+----------+
2rowsselected(71.125seconds)
Enablefetchandverifytheperformanceimprovement:
jdbc:hive2://>SEThive.fetch.task.conversion=more;
Norowsaffected(0.002seconds)
jdbc:hive2://>SELECTnameFROMemployee;
+----------+
|name|
+----------+
|Michael|
|Will|
|Shelley|
|Lucy|
+----------+
4rowsselected(0.242seconds)
BesidesLIMIT,WHEREisanothergenericconditionclausetolimitthereturnedresultset.TheWHEREconditioncanbeanyBooleanexpressionoruser-definedfunctionscomparingtotableorpartitioncolumns:
jdbc:hive2://>SELECTname,work_placeFROMemployee
.......>WHEREname='Michael';
+----------+-------------------------+
|name|work_place|
+----------+-------------------------+
|Michael|["Montreal","Toronto"]|
+----------+-------------------------+
1rowselected(38.107seconds)
MultipleSELECTstatementscanworktogethertobuildacomplexqueryusingnestorsubqueries,suchasJOINandUNION.Thefollowingareafewexamplestousenest/subqueries.SubqueriescanbeusedintheformatofWITH(alsoreferredtoasCTEsinceHive0.13.0),aftertheFROMorWHEREstatement.Whenusingsubqueries,analiasshouldbegivenforthesubquery(seet1inthefollowingexample).Orelse,Hivewillreportexceptions.ThedifferentusesofSELECTstatementsareasfollows:
NestedSELECTusingCTEcanbeimplementedasfollows:
jdbc:hive2://>WITHt1AS(
.......>SELECT*FROMemployee
.......>WHEREsex_age.sex='Male')
.......>SELECTname,sex_age.sexASsexFROMt1;
+----------+-------+
|name|sex|
www.it-ebooks.info
+----------+-------+
|Michael|Male|
|Will|Male|
+----------+-------+
2rowsselected(38.706seconds)
NestedSELECTaftertheFROMstatementcanbeimplementedasfollows:
jdbc:hive2://>SELECTname,sex_age.sexASsex
.......>FROM
.......>(
.......>SELECT*FROMemployee
.......>WHEREsex_age.sex='Male'
.......>)t1;
+----------+-------+
|name|sex|
+----------+-------+
|Michael|Male|
|Will|Male|
+----------+-------+
2rowsselected(48.198seconds)
TheHivesubqueryintheWHEREclausecanbeusedwithIN,NOTIN,EXIST,orNOTEXISTasfollows.Ifthealias(seethefollowingexamplefortheemployeetable)isnotspecifiedbeforecolumns(name)intheWHEREcondition,HivewillreporttheerrorCorrelatingexpressioncannotcontainunqualifiedcolumnreferences.ThisisalimitationoftheHivesubquery.AsubquerythatusesEXISTorNOTEXISTmustrefertobothinnerandouterexpression.ThisissimilartotheJOINtable,whichisintroducedlater.ThisisnotsupportedbytheINandNOTINclause.
jdbc:hive2://>SELECTname,sex_age.sexASsex
.......>FROMemployeea
.......>WHEREa.nameIN
.......>(SELECTnameFROMemployee
.......>WHEREsex_age.sex='Male'
.......>);
+----------+-------+
|name|sex|
+----------+-------+
|Michael|Male|
|Will|Male|
+----------+-------+
2rowsselected(54.644seconds)
jdbc:hive2://>SELECTname,sex_age.sexASsex
.......>FROMemployeea
.......>WHEREEXISTS
.......>(SELECT*FROMemployeeb
.......>WHEREa.sex_age.sex=b.sex_age.sex
.......>ANDb.sex_age.sex='Male'
.......>);
+----------+-------+
|name|sex|
+----------+-------+
|Michael|Male|
www.it-ebooks.info
|Will|Male|
+----------+-------+
2rowsselected(69.48seconds)
ThereareadditionalrestrictionsforsubqueriesusedinWHEREclauses:
Subqueriescanonlyappearontheright-handsideoftheWHEREclausesNestedsubqueriesarenotallowedTheINandNOTINstatementsupportsonlyonecolumn
www.it-ebooks.info
www.it-ebooks.info
TheINNERJOINstatementHiveJOINisusedtocombinerowsfromtwoormoretablestogether.HivesupportscommonJOINoperationssuchaswhat’sintheRDBMS,forexample,JOIN,LEFTOUTERJOIN,RIGHTOUTERJOIN,FULLOUTERJOIN,andCROSSJOIN.However,HiveonlysupportsequalJOINinsteadofunequalJOIN,becauseunequalJOINisdifficulttobeconvertedtoMapReducejobs.
TheINNERJOINinHiveusesJOINkeywords,whichreturnrowsmeetingtheJOINconditionsfrombothleftandrighttables.TheINNERJOINkeywordcanalsobeomittedbycomma-separatedtablenamessinceHive0.13.0.SeethefollowingexamplestoshowvariousinnerJOINstatementsinHive:
Prepareanothertabletojoinandloaddata:
jdbc:hive2://>CREATETABLEIFNOTEXISTSemployee_hr
.......>(
.......>namestring,
.......>employee_idint,
.......>sin_numberstring,
.......>start_datedate
.......>)
.......>ROWFORMATDELIMITED
.......>FIELDSTERMINATEDBY'|'
.......>STOREDASTEXTFILE;
Norowsaffected(1.732seconds)
jdbc:hive2://>LOADDATALOCALINPATH
.......>'/home/Dayongd/employee_hr.txt'
.......>OVERWRITEINTOTABLEemployee_hr;
Norowsaffected(0.635seconds)
PerforminnerJOINbetweentwotableswithequalJOINconditions:
jdbc:hive2://>SELECTemp.name,emph.sin_number
.......>FROMemployeeemp
.......>JOINemployee_hremphONemp.name=emph.name;
+-----------+------------------+
|emp.name|emph.sin_number|
+-----------+------------------+
|Michael|547-968-091|
|Will|527-948-090|
|Lucy|577-928-094|
+-----------+------------------+
3rowsselected(71.083seconds)
TheJOINoperationcanbeperformedamongmoretables(threetablesinthiscase),asfollows:
jdbc:hive2://>SELECTemp.name,empi.employee_id,emph.sin_number
.......>FROMemployeeemp
.......>JOINemployee_hremphONemp.name=emph.name
.......>JOINemployee_idempiONemp.name=empi.name;
+-----------+-------------------+------------------+
www.it-ebooks.info
|emp.name|empi.employee_id|emph.sin_number|
+-----------+-------------------+------------------+
|Michael|100|547-968-091|
|Will|101|527-948-090|
|Lucy|103|577-928-094|
+-----------+-------------------+------------------+
3rowsselected(67.933seconds)
Self-joinisaspecialJOINwhereonetablejoinsitself.Whendoingsuchjoins,adifferentaliasshouldbegiventodistinguishthesametable:
jdbc:hive2://>SELECTemp.name
.......>FROMemployeeemp
.......>JOINemployeeemp_b
.......>ONemp.name=emp_b.name;
+-----------+
|emp.name|
+-----------+
|Michael|
|Will|
|Shelley|
|Lucy|
+-----------+
4rowsselected(59.891seconds)
ImplicitjoinisaJOINoperationwithoutusingtheJOINkeyword.ItissupportedsinceHive0.13.0:
jdbc:hive2://>SELECTemp.name,emph.sin_number
.......>FROMemployeeemp,employee_hremph
.......>WHEREemp.name=emph.name;
+-----------+------------------+
|emp.name|emph.sin_number|
+-----------+------------------+
|Michael|547-968-091|
|Will|527-948-090|
|Lucy|577-928-094|
+-----------+------------------+
3rowsselected(47.241seconds)
TheJOINoperationusesdifferentcolumnsinjoinconditionsandwillcreateanadditionalMapReduce:
jdbc:hive2://>SELECTemp.name,empi.employee_id,emph.sin_number
.......>FROMemployeeemp
.......>JOINemployee_hremphONemp.name=emph.name
.......>JOINemployee_idempiONemph.employee_id=
empi.employee_id;
+-----------+-------------------+------------------+
|emp.name|empi.employee_id|emph.sin_number|
+-----------+-------------------+------------------+
|Michael|100|547-968-091|
|Will|101|527-948-090|
|Lucy|103|577-928-094|
+-----------+-------------------+------------------+
3rowsselected(49.785seconds)
www.it-ebooks.info
NoteIfJOINusesdifferentcolumnsinthejoinconditions,itwillrequestadditionaljobstagestocompletethejoin.IftheJOINoperationusesthesamecolumninthejoinconditions,Hivewilljoinonthisconditionusingonestage.
WhenJOINisperformedbetweenmultipletables,theMapReducejobsarecreatedtoprocessthedataintheHDFS.Eachofthejobsiscalledastage.Usually,itissuggestedforJOINstatementstoputthebigtablerightattheendforbetterperformanceaswellasavoidingOutOfMemory(OOM)exceptions,becausethelasttableinthesequenceisstreamedthroughthereducerswheretheothersarebufferedinthereducerbydefault.Also,ahint,suchas/*+STREAMTABLE(table_name)*/,canbespecifiedtotellwhichtableisstreamedasfollows:
jdbc:hive2://>SELECT/*+STREAMTABLE(employee_hr)*/
.......>emp.name,empi.employee_id,emph.sin_number
.......>FROMemployeeemp
.......>JOINemployee_hremphONemp.name=emph.name
.......>JOINemployee_idempiONemph.employee_id=
empi.employee_id;
www.it-ebooks.info
www.it-ebooks.info
TheOUTERJOINandCROSSJOINstatementsBesidesINNERJOIN,HivealsosupportsregularOUTERJOINandFULLJOIN.ThelogicofsuchJOINisthesametowhat’sintheRDBMS.ThefollowingtablesummarizesthedifferencesofacommonJOIN:
CommonJOINtype
LogicRowsreturned(assumetable_mhasmrowsandtable_nhasnrows)
table_m
JOIN
table_n
Thisreturnsallrowsmatchedinbothtables. m∩n
table_m
LEFT
[OUTER]
JOIN
table_n
Thisreturnsallrowsinthelefttableandmatchedrowsintherighttable.Ifthereisnomatchintherighttable,returnnullintherighttable.
m
table_m
RIGHT
[OUTER]
JOIN
table_n
Thisreturnsallrowsintherighttableandmatchedrowsinthelefttable.Ifthereisnomatchinthelefttable,returnnullinthelefttable. n
table_m
FULL
[OUTER]
JOIN
table_n
Thisreturnsallrowsinboththetablesandmatchedrowsinboththetables.Ifthereisnomatchintheleftorrighttable,returnnullinstead. m+n-m∩n
table_m
CROSS
JOIN
table_n
ThisreturnsallrowcombinationsinboththetablestoproduceaCartesianproduct. m*n
ThefollowingexamplesdemonstrateOUTERJOIN:
jdbc:hive2://>SELECTemp.name,emph.sin_number
.......>FROMemployeeemp
.......>LEFTJOINemployee_hremphONemp.name=emph.name;
+-----------+------------------+
|emp.name|emph.sin_number|
+-----------+------------------+
|Michael|547-968-091|
|Will|527-948-090|
|Shelley|NULL|
|Lucy|577-928-094|
+-----------+------------------+
4rowsselected(39.637seconds)
www.it-ebooks.info
jdbc:hive2://>SELECTemp.name,emph.sin_number
.......>FROMemployeeemp
.......>RIGHTJOINemployee_hremphONemp.name=emph.name;
+-----------+------------------+
|emp.name|emph.sin_number|
+-----------+------------------+
|Michael|547-968-091|
|Will|527-948-090|
|NULL|647-968-598|
|Lucy|577-928-094|
+-----------+------------------+
4rowsselected(34.485seconds)
jdbc:hive2://>SELECTemp.name,emph.sin_number
.......>FROMemployeeemp
.......>FULLJOINemployee_hremphONemp.name=emph.name;
+-----------+------------------+
|emp.name|emph.sin_number|
+-----------+------------------+
|Lucy|577-928-094|
|Michael|547-968-091|
|Shelley|NULL|
|NULL|647-968-598|
|Will|527-948-090|
+-----------+------------------+
5rowsselected(64.251seconds)
TheCROSSJOINstatement,whichisavailablesinceHive0.10.0,doesnothavetheJOINcondition.TheCROSSJOINstatementcanalsobewrittenusingJOINwithoutconditionorwiththealwaystruecondition,suchas1=1.ThefollowingthreewaysofwritingCROSSJOINproducethesameresultset:
jdbc:hive2://>SELECTemp.name,emph.sin_number
.......>FROMemployeeemp
.......>CROSSJOINemployee_hremph;
jdbc:hive2://>SELECTemp.name,emph.sin_number
.......>FROMemployeeemp
.......>JOINemployee_hremph;
jdbc:hive2://>SELECTemp.name,emph.sin_number
.......>FROMemployeeemp
.......>JOINemployee_hremphon1=1;
+-----------+------------------+
|emp.name|emph.sin_number|
+-----------+------------------+
|Michael|547-968-091|
|Michael|527-948-090|
|Michael|647-968-598|
|Michael|577-928-094|
|Will|547-968-091|
|Will|527-948-090|
|Will|647-968-598|
|Will|577-928-094|
www.it-ebooks.info
|Shelley|547-968-091|
|Shelley|527-948-090|
|Shelley|647-968-598|
|Shelley|577-928-094|
|Lucy|547-968-091|
|Lucy|527-948-090|
|Lucy|647-968-598|
|Lucy|577-928-094|
+-----------+------------------+
16rowsselected(34.924seconds)
Inaddition,JOINalwayshappensbeforeWHERE.Ifpossible,pushconditionssuchastheJOINconditionsratherthanWHEREconditionstofiltertheresultsetafterJOINimmediately.What’smore,JOINisNOTcommutative!ItisalwaysleftassociativenomatterwhethertheyareLEFTJOINorRIGHTJOIN.
AlthoughHivedoesnotsupportunequalJOINexplicitly,thereareworkaroundsusingCROSSJOINandWHEREconditionsmentionedinthefollowingexample:
jdbc:hive2://>SELECTemp.name,emph.sin_number
.......>FROMemployeeemp
.......>JOINemployee_hremphONemp.name<>emph.name;
Error:Errorwhilecompilingstatement:FAILED:SemanticException[Error
10017]:Line1:77BothleftandrightaliasesencounteredinJOIN'name'
(state=42000,code=10017)
jdbc:hive2://>SELECTemp.name,emph.sin_number
.......>FROMemployeeemp
.......>CROSSJOINemployee_hremphWHEREemp.name<>emph.name;
+-----------+------------------+
|emp.name|emph.sin_number|
+-----------+------------------+
|Michael|527-948-090|
|Michael|647-968-598|
|Michael|577-928-094|
|Will|547-968-091|
|Will|647-968-598|
|Will|577-928-094|
|Shelley|547-968-091|
|Shelley|527-948-090|
|Shelley|647-968-598|
|Shelley|577-928-094|
|Lucy|547-968-091|
|Lucy|527-948-090|
|Lucy|647-968-598|
+-----------+------------------+
13rowsselected(35.016seconds)
www.it-ebooks.info
www.it-ebooks.info
SpecialJOIN–MAPJOINTheMAPJOINstatementmeansdoingtheJOINoperationonlybymapwithoutthereducejob.TheMAPJOINstatementreadsallthedatafromthesmalltabletomemoryandbroadcaststoallmaps.Duringthemapphase,theJOINoperationisperformedbycomparingeachrowofdatainthebigtablewithsmalltablesagainstthejoinconditions.Becausethereisnoreduceneeded,theJOINperformanceisimproved.Whenthehive.auto.convert.joinsettingissettotrue,HiveautomaticallyconvertstheJOINtoMAPJOINatruntimeifpossibleinsteadofcheckingthemapjoinhint.Inaddition,MAPJOINcanbeusedforunequaljoinstoimproveperformancesincebothMAPJOINandWHEREareperformedinthemapphase.ThefollowingisanexampleofMAPJOINthatisenabledbyqueryhint:
jdbc:hive2://>SELECT/*+MAPJOIN(employee)*/emp.name,emph.sin_number
.......>FROMemployeeemp
.......>CROSSJOINemployee_hremphWHEREemp.name<>emph.name;
TheMAPJOINoperationdoesnotsupportthefollowing:
TheuseofMAPJOINafterUNIONALL,LATERALVIEW,GROUPBY/JOIN/SORTBY/CLUSTERBY/DISTRIBUTEBYTheuseofMAPJOINbeforeUNION,JOIN,andanotherMAPJOIN
ThebucketmapjoinisaspecialtypeofMAPJOINthatusesbucketcolumns(thecolumnspecifiedbyCLUSTEREDBYintheCREATEtablestatement)asthejoincondition.Insteadoffetchingthewholetableasdonebytheregularmapjoin,bucketmapjoinonlyfetchestherequiredbucketdata.Toenablebucketmapjoin,weneedtosethive.optimize.bucketmapjoin=trueandmakesurethebucketsnumberisamultipleofeachother.Ifbothtablesjoinedaresortedandbucketedwiththesamenumberofbuckets,asort-mergejoincanbeperformedinsteadofcachingallsmalltablesinthememory.Thefollowingadditionalsettingsareneededtoenablethisbehavior:
SEThive.optimize.bucketmapjoin=true;
SEThive.optimize.bucketmapjoin.sortedmerge=true;
SET
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
TheLEFTSEMIJOINstatementisalsoatypeofMAPJOIN.BeforeHivesupportsIN/EXIST,LEFTSEMIJOINisusedtoimplementsucharequestasshowninthefollowingexample.TherestrictionofusingLEFTSEMIJOINisthattheright-handsidetableshouldonlybereferencedinthejoincondition,butnotinWHEREorSELECTclauses.
jdbc:hive2://>SELECTa.name
.......>FROMemployeea
.......>WHEREEXISTS
.......>(SELECT*FROMemployee_idb
.......>WHEREa.name=b.name);
jdbc:hive2://>SELECTa.name
.......>FROMemployeea
www.it-ebooks.info
.......>LEFTSEMIJOINemployee_idb
.......>ONa.name=b.name;
+----------+
|a.name|
+----------+
|Michael|
|Will|
|Shelley|
|Lucy|
+----------+
4rowsselected(35.027seconds)
www.it-ebooks.info
www.it-ebooks.info
Setoperation–UNIONALLTooperatetheresultsetvertically,HiveonlysupportsUNIONALLrightnow.And,theresultsetofUNIONALLkeepsduplicatesifany.BeforeHive0.13.0,UNIONALLcanonlybeusedinthesubquery.SinceHive0.13.0,UNIONALLcanalsobeusedintop-levelqueries.ThefollowingareexamplesoftheUNIONALLstatements:
Checkthenamecolumnintheemployee_hrandemployeetable:
jdbc:hive2://>SELECTnameFROMemployee_hr;
+----------+
|name|
+----------+
|Michael|
|Will|
|Steven|
|Lucy|
+----------+
4rowsselected(0.116seconds)
jdbc:hive2://>SELECTnameFROMemployee;
+----------+
|name|
+----------+
|Michael|
|Will|
|Shelley|
|Lucy|
+----------+
4rowsselected(0.049seconds)
UseUNIONonthenamecolumnfrombothtables,includingduplications:
jdbc:hive2://>SELECTa.name
.......>FROMemployeea
.......>UNIONALL
.......>SELECTb.name
.......>FROMemployee_hrb;
+-----------+
|_u1.name|
+-----------+
|Michael|
|Will|
|Shelley|
|Lucy|
|Michael|
|Will|
|Steven|
|Lucy|
+-----------+
8rowsselected(39.93seconds)
ForothersetoperationssupportedbyRDBMS,suchasUNION,INTERCEPT,andMINUS,wecanuseSELECTwiththeWHEREconditiontoimplementthemasfollows:
www.it-ebooks.info
ImplementUNIONbetweentwotableswithoutduplications:
jdbc:hive2://>SELECTDISTINCTname
.......>FROM
.......>(
.......>SELECTa.nameASname
.......>FROMemployeea
.......>UNIONALL
.......>SELECTb.nameASname
.......>FROMemployee_hrb
.......>)union_set;
+----------+
|name|
+----------+
|Lucy|
|Michael|
|Shelley|
|Steven|
|Will|
+----------+
5rowsselected(100.366seconds)
NoteThesubqueryalias(suchasunion_setinthisexample)mustbegiventoavoidaHivesyntaxerror.
TheemployeetableimplementsINTERCEPTonemployee_hrusingJOIN:
jdbc:hive2://>SELECTa.name
.......>FROMemployeea
.......>JOINemployee_hrb
.......>ONa.name=b.name;
+----------+
|a.name|
+----------+
|Michael|
|Will|
|Lucy|
+----------+
3rowsselected(44.862seconds)
TheemployeetableimplementsMINUSonemployee_hrusingOUTERJOIN:
jdbc:hive2://>SELECTa.name
.......>FROMemployeea
.......>LEFTJOINemployee_hrb
.......>ONa.name=b.name
.......>WHEREb.nameISNULL;
+----------+
|a.name|
+----------+
|Shelley|
+----------+
1rowselected(36.841seconds)
www.it-ebooks.info
www.it-ebooks.info
SummaryInthischapter,youlearnedtouseSELECTstatementstodiscoverthedatayouneed.Then,weintroducedHiveoperationstolinkdifferentdatasetsfromverticalorhorizontaldirectionsusingJOINorUNIONALL.Aftergoingthroughthischapter,weshouldbeabletousetheSELECTstatementwithdifferentWHEREconditions,LIMIT,DISTINCT,andcomplexsubqueries.WeshouldbeabletounderstandandusedifferenttypesofJOINstatementstolinkthedifferentdatasetshorizontallyandUNIONALLtocombinethedifferentdatasetsvertically.
Inthenextchapter,wewilltalkaboutthedetailsofexchange,order,andtransformingdataaswellastransactionsinHive.
www.it-ebooks.info
www.it-ebooks.info
Chapter5.DataManipulationTheabilitytomanipulatedataisacriticalcapabilityinbigdataanalysis.Manipulatingdataistheprocessofexchanging,moving,sorting,andtransformingthedata.Thistechniqueisusedinmanysituations,suchascleaningdata,searchingpatterns,creatingtrends,andsoon.Hiveoffersvariousquerystatements,keywords,operators,andfunctionstocarryoutdatamanipulation.
Inthischapter,wewillcoverthefollowingtopics:
DataexchangeusingLOAD,INSERT,IMPORT,andEXPORTOrderandsortOperatorsandfunctionsTransaction
www.it-ebooks.info
Dataexchange–LOADTomovedatainHive,itusestheLOADkeyword.Moveheremeanstheoriginaldataismovedtothetargettable/partitionanddoesnotexistintheoriginalplaceanymore.ThefollowingisanexampleofhowtomovedatatotheHivetableorpartitionfromlocalorHDFSfiles.TheLOCALkeywordspecifieswherethefilesarelocatedinthehost.IftheLOCALkeywordisnotspecified,thefilesareloadedfromthefullUniformResourceIdentifier(URI)specifiedafterINPATHorthevaluefromthefs.default.nameHivepropertybydefault.ThepathafterINPATHcanbearelativepathoranabsolutepath.Thepatheitherpointstoafileorafolder(allfilesinthefolder)tobeloaded,butthesubfolderisnotallowedinthepathspecified.Ifthedataisloadedintoapartitiontable,thepartitioncolumnmustbespecified.TheOVERWRITEkeywordisusedtodecidewhethertoappendorreplacetheexistingdatainthetargettable/partition.
ThefollowingaretheexamplestoloadfilesintoHivetables:
LoadlocaldatatotheHivetable:
jdbc:hive2://>LOADDATALOCALINPATH
.......>'/home/dayongd/Downloads/employee_hr.txt'
.......>OVERWRITEINTOTABLEemployee_hr;
Norowsaffected(0.436seconds)
LoadlocaldatatotheHivepartitiontable:
jdbc:hive2://>LOADDATALOCALINPATH
.......>'/home/dayongd/Downloads/employee.txt'
.......>OVERWRITEINTOTABLEemployee_partitioned
.......>PARTITION(year=2014,month=12);
Norowsaffected(0.772seconds)
LoadHDFSdatatotheHivetableusingthedefaultsystempath:
jdbc:hive2://>LOADDATAINPATH
.......>'/user/dayongd/employee/employee.txt'
.......>OVERWRITEINTOTABLEemployee;
Norowsaffected(0.453seconds)
LoadHDFSdatatotheHivetablewithfullURI:
jdbc:hive2://>LOADDATAINPATH
.......>
'hdfs://[dfs_host]:8020/user/dayongd/employee/employee.txt'
.......>OVERWRITEINTOTABLEemployee;
Norowsaffected(0.297seconds)
www.it-ebooks.info
www.it-ebooks.info
Dataexchange–INSERTToextractthedatafromHivetables/partitions,wecanusetheINSERTkeyword.LikeRDBMS,Hivesupportsinsertingdatabyselectingdatafromothertables.Thisisaverycommonwaytopopulateatablefromexistingdata.ThebasicINSERTstatementhasthesamesyntaxasarelationaldatabase’sINSERT.However,HivehasimproveditsINSERTstatementbysupportingOVERWRITE,multipleINSERT,dynamicpartitionINSERT,aswellasusingINSERTtofiles.Thefollowingareafewexamples:
ThefollowingisaregularINSERTfromtheSELECTstatement:
--Checkthetargettable,whichisempty.
jdbc:hive2://>SELECTname,work_place,sex_age
.......>FROMemployee;
+-------------+-------------------+----------------+
|employee.name|employee.work_place|employee.sex_age|
+-------------+-------------------+----------------+
+-------------+-------------------+----------------+
Norowsselected(0.115seconds)
--PopulatedatafromSELECT
jdbc:hive2://>INSERTINTOTABLEemployee
.......>SELECT*FROMctas_employee;
Norowsaffected(31.701seconds)
--Verifythedataloaded
jdbc:hive2://>SELECTname,work_place,sex_ageFROMemployee;
+-------------+----------------------+-------------------------+
|employee.name|employee.work_place|employee.sex_age|
+-------------+----------------------+-------------------------+
|Michael|["Montreal","Toronto"]|{"sex":"Male","age":30}|
|Will|["Montreal"]|{"sex":"Male","age":35}|
|Shelley|["NewYork"]|{"sex":"Female","age":27}|
|Lucy|["Vancouver"]|{"sex":"Female","age":57}|
+-------------+----------------------+-------------------------+
4rowsselected(0.12seconds)
InsertdatafromtheCTEstatement:
jdbc:hive2://>WITHaAS(SELECT*FROMctas_employee)
.......>FROMa
.......>INSERTOVERWRITETABLEemployee
.......>SELECT*;
Norowsaffected(30.1seconds)
RunmultipleINSERTbyonlyscanningthesourcetableonce:
jdbc:hive2://>FROMctas_employee
.......>INSERTOVERWRITETABLEemployee
.......>SELECT*
.......>INSERTOVERWRITETABLEemployee_internal
.......>SELECT*;
Norowsaffected(27.919seconds)
www.it-ebooks.info
NoteTheINSERTOVERWRITEstatementwillreplacethedatainthetargettable/partitionwhileINSERTINTOwillappenddata.
Wheninsertingdatatothepartitions,weneedtospecifythepartitioncolumns.Insteadofspecifyingstaticvaluesforstaticpartitions,Hivealsosupportsdynamicallygivingpartitionvalues.Dynamicpartitionsareusefulwhenthedatavolumeislargeandwedon’tknowwhatwillbethepartitionvalues.Forexample,thedateisdynamicallyusedaspartitioncolumns.
Dynamicpartitionisnotenabledbydefault.Weneedtosetthefollowingpropertiestomakeitwork:
jdbc:hive2://>SEThive.exec.dynamic.partition=true;
Norowsaffected(0.002seconds)
Bydefault,theusermustspecifyatleastonestaticpartitioncolumn.Thisistoavoidaccidentallyoverwritingpartitions.Todisablethisrestriction,wecansetthepartitionmodetononstrictfromthedefaultstrictmodebeforeinsertingintodynamicpartitionsasfollows:
jdbc:hive2://>SEThive.exec.dynamic.partition.mode=nonstrict;
Norowsaffected(0.002seconds)
jdbc:hive2://>INSERTINTOTABLEemployee_partitioned
.......>PARTITION(year,month)
.......>SELECTname,array('Toronto')aswork_place,
.......>named_struct("sex","Male","age",30)assex_age,
.......>map("Python",90)asskills_score,
.......>map("R&D",array('Developer'))asdepart_title,
.......>year(start_date)asyear,month(start_date)asmonth
.......>FROMemployee_hreh
.......>WHEREeh.employee_id=102;
Norowsaffected(29.024seconds)
NoteComplextypeconstructorsareusedintheprecedingexampletoassignaconstantvaluetoacomplexdatatypecolumn.
TheHiveINSERTtofilesstatementistheoppositeoperationforLOAD.ItextractsthedatafromSELECTstatementstolocalorHDFSfiles.However,itonlysupportstheOVERWRITEkeyword,notINTO.Thismeanswecannotappenddataextractedtotheexistingfiles.Bydefault,thecolumnsareseparatedby^Aandrowsareseparatedbynewlines.SinceHive0.11.0,rowseparatorscanbespecified.Thefollowingareafewexamplestoinsertdatatofiles:
Wecaninserttolocalfileswithdefaultrowseparators.InsomerecentversionofHadoop,thelocaldirectorypathonlyworksforadirectorylevellessthantwo.Wemayneedtosethive.insert.into.multilevel.dirs=truetogetthisfixed:
jdbc:hive2://>INSERTOVERWRITELOCALDIRECTORY'/tmp/output1'
www.it-ebooks.info
.......>SELECT*FROMemployee;
Norowsaffected(30.859seconds)
NoteBydefault,manypartialfilescouldbecreatedbythereducerwhendoingINSERT.Tomergethemintoone,wecanuseHDFScommands,asshowninthefollowingexample:
hdfsdfs–getmergehdfs://<host_name>:8020/user/dayongd/output
/tmp/test
Inserttolocalfileswithspecifiedrowseparators:
jdbc:hive2://>INSERTOVERWRITELOCALDIRECTORY'/tmp/output2'
.......>ROWFORMATDELIMITEDFIELDSTERMINATEDBY','
.......>SELECT*FROMemployee;
Norowsaffected(31.937seconds)
--Verifytheseparator
vi/tmp/output2/000000_0
Michael,Montreal^BToronto,Male^B30,DB^C80,Product^CDeveloper^DLead
Will,Montreal,Male^B35,Perl^C85,Product^CLead^BTest^CLead
Shelley,NewYork,Female^B27,Python^C80,Test^CLead^BCOE^CArchitect
Lucy,Vancouver,Female^B57,Sales^C89^BHR^C94,Sales^CLead
FiremultipleINSERTstatementsfromthesametableSELECTstatement:
jdbc:hive2://>FROMemployee
.......>INSERTOVERWRITEDIRECTORY'/user/dayongd/output'
.......>SELECT*
.......>INSERTOVERWRITEDIRECTORY'/user/dayongd/output1'
.......>SELECT*;
Norowsaffected(25.4seconds)
NoteBesidestheHiveINSERTstatement,HiveandHDFSshellcommandscanalsobeusedtoextractdatatolocalorremotefileswithbothappendandoverwritesupport.Thehive-e'quoted_hql_string'orhive-f<hql_filename>commandscanexecuteaHivequerystatementorqueryfile.Linuxredirectoperatorsandpipingcanbeusedwiththesecommandstoredirectresultsets.Thefollowingareafewexamples:
Appendtolocalfiles:
$hive-e'select*fromemployee'>>test
Overwritetolocalfiles:
$hive-e'select*fromemployee'>test
AppendtoHDFSfiles:
$hive-e'select*fromemployee'|hdfsdfs-appendToFile-
/user/dayongd/output2/test
OverwritetoHDFSfiles:
www.it-ebooks.info
$hive-e'select*fromemployee'|hdfsdfs-put-f-
/user/dayongd/output2/test
www.it-ebooks.info
www.it-ebooks.info
Dataexchange–EXPORTandIMPORTWhenworkingwithHive,sometimesweneedtomigratedataamongdifferentenvironments.Orwemayneedtobackupsomedata.SinceHive0.8.0,EXPORTandIMPORTstatementsareavailabletosupporttheimportandexportofdatainHDFSfordatamigrationorbackup/restorepurposes.
TheEXPORTstatementwillexportbothdataandmetadatafromatableorpartition.Metadataisexportedinafilecalled_metadata.Dataisexportedinasubdirectorycalleddata:
jdbc:hive2://>EXPORTTABLEemployeeTO'/user/dayongd/output3';
Norowsaffected(0.19seconds)
AfterEXPORT,wecanmanuallycopytheexportedfilestootherHiveinstancesoruseHadoopdistcpcommandstocopytootherHDFSclusters.Then,wecanimportthedatainthefollowingmanner:
Importdatatoatablewiththesamename.Itthrowsanerrorifthetableexists:
jdbc:hive2://>IMPORTFROM'/user/dayongd/output3';
Error:Errorwhilecompilingstatement:FAILED:SemanticException
[Error10119]:Tableexistsandcontainsdatafiles
(state=42000,code=10119)
Importdatatoanewtable:
jdbc:hive2://>IMPORTTABLEempolyee_importedFROM
.......>'/user/dayongd/output3';
Norowsaffected(0.788seconds)
Importdatatoanexternaltable,wheretheLOCATIONpropertyisoptional:
jdbc:hive2://>IMPORTEXTERNALTABLEempolyee_imported_external
.......>FROM'/user/dayongd/output3'
.......>LOCATION'/user/dayongd/output4';
Norowsaffected(0.256seconds)
Exportandimportpartitions:
jdbc:hive2://>EXPORTTABLEemployee_partitionedpartition
.......>(year=2014,month=11)TO'/user/dayongd/output5';
Norowsaffected(0.247seconds)
jdbc:hive2://>IMPORTTABLEemployee_partitioned_imported
.......>FROM'/user/dayongd/output5';
Norowsaffected(0.14seconds)
www.it-ebooks.info
www.it-ebooks.info
ORDERandSORTAnotheraspecttomanipulatedatainHiveistoproperlyorderorsortthedataorresultsetstoclearlyidentifytheimportantfacts,suchastopNvalues,maximum,minimum,andsoon.
TherearethefollowingkeywordsusedinHivetoorderandsortdata:
ORDERBY(ASC|DESC):ThisissimilartotheRDBMSORDERBYstatement.Asortedorderismaintainedacrossalloftheoutputfromeveryreducer.Itperformstheglobalsortusingonlyonereducer,soittakesalongertimetoreturntheresult.UsagewithLIMITisstronglyrecommendedforORDERBY.Whenhive.mapred.mode=strict(bydefault,hive.mapred.mode=nonstrict)issetandwedonotspecifyLIMIT,thereareexceptions.Thiscanbeusedasfollows:
jdbc:hive2://>SELECTnameFROMemployeeORDERBYNAMEDESC;
+----------+
|name|
+----------+
|Will|
|Shelley|
|Michael|
|Lucy|
+----------+
4rowsselected(57.057seconds)
SORTBY(ASC|DESC):Thisindicateswhichcolumnstosortwhenorderingthereducerinputrecords.Thismeansitcompletessortingbeforesendingdatatothereducer.TheSORTBYstatementdoesnotperformaglobalsortandonlymakessuredataislocallysortedineachreducerunlesswesetmapred.reduce.tasks=1.Inthiscase,itisequaltotheresultofORDERBY.Itcanbeusedasfollows:
--Usemorethan1reducer
jdbc:hive2://>SETmapred.reduce.tasks=2;
Norowsaffected(0.001seconds)
jdbc:hive2://>SELECTnameFROMemployeeSORTBYNAMEDESC;
+----------+
|name|
+----------+
|Shelley|
|Michael|
|Lucy|
|Will|
+----------+
4rowsselected(54.386seconds)
--Useonly1reducer
jdbc:hive2://>SETmapred.reduce.tasks=1;
Norowsaffected(0.002seconds)
jdbc:hive2://>SELECTnameFROMemployeeSORTBYNAMEDESC;
+----------+
www.it-ebooks.info
|name|
+----------+
|Will|
|Shelley|
|Michael|
|Lucy|
+----------+
4rowsselected(46.03seconds)
DISTRIBUTEBY:Rowswithmatchingcolumnvalueswillbepartitionedtothesamereducer.Whenusedalone,itdoesnotguaranteesortedinputtothereducer.TheDISTRIBUTEBYstatementissimilartoGROUPBYinRDBMSintermsofdecidingwhichreducertodistributethemapperoutputto.WhenusingwithSORTBY,DISTRIBUTEBYmustbespecifiedbeforetheSORTBYstatement.And,thecolumnusedtodistributemustappearintheselectcolumnlist.Itcanbeusedasfollows:
jdbc:hive2://>SELECTname
.......>FROMemployee_hrDISTRIBUTEBYemployee_id;
Error:Errorwhilecompilingstatement:FAILED:SemanticException
[Error10004]:Line1:44Invalidtablealiasorcolumnreference
'employee_id':(possiblecolumnnamesare:name)
(state=42000,code=10004)
jdbc:hive2://>SELECTname,employee_id
.......>FROMemployee_hrDISTRIBUTEBYemployee_id;
+----------+--------------+
|name|employee_id|
+----------+--------------+
|Lucy|103|
|Steven|102|
|Will|101|
|Michael|100|
+----------+--------------+
4rowsselected(38.92seconds)
--UsedwithSORTBY
jdbc:hive2://>SELECTname,employee_id
.......>FROMemployee_hr
.......>DISTRIBUTEBYemployee_idSORTBYname;
+----------+--------------+
|name|employee_id|
+----------+--------------+
|Lucy|103|
|Michael|100|
|Steven|102|
|Will|101|
+----------+--------------+
4rowsselected(38.01seconds)
CLUSTERBY:ThisisashorthandoperatortoperformDISTRIBUTEBYandSORTBYoperationsonthesamegroupofcolumns.And,itissortedlocallyineachreducer.TheCLUSTERBYstatementdoesnotsupportASCorDESCyet.ComparedtoORDERBY,whichisgloballysorted,theCLUSTERBYoperationissortedineachdistributedgroup.Tofullyutilizealltheavailablereducerswhendoingaglobalsort,wecando
www.it-ebooks.info
CLUSTERBYfirstandthenORDERBY.Thiscanbeusedasfollows:
jdbc:hive2://>SELECTname,employee_id
.......>FROMemployee_hrCLUSTERBYname;
+----------+--------------+
|name|employee_id|
+----------+--------------+
|Lucy|103|
|Michael|100|
|Steven|102|
|Will|101|
+----------+--------------+
4rowsselected(39.791seconds)
ThedifferencebetweenORDERBYandCLUSTERBYcanbeseeninthefollowingdiagram:
www.it-ebooks.info
www.it-ebooks.info
OperatorsandfunctionsTofurthermanipulatedata,wecanalsouseexpressions,operators,andfunctionsinHivetotransformdata.TheHivewiki(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)hasofferedspecificationsforeachexpressionandfunction,sowedonotwanttorepeatallofthemhereexceptafewimportantusagesortipsinthischapter.
Hivehasdefinedrelationaloperators,arithmeticoperators,logicaloperators,complextypeconstructors,andcomplextypeoperators.Forrelational,arithmetic,andlogicaloperators,theyaresimilartostandardoperatorsinSQL/Java.Wedonotrepeatthemagaininthischapter.Foroperatorsonacomplexdatatype,wehavealreadyintroducedthemintheUnderstandingHivedatatypessectionofChapter3,DataDefinitionandDescription,aswellastheexampleforadynamicpartitioninsertinthischapter.
ThefunctionsinHivearecategorizedasfollows:
Mathematicalfunctions:Thesefunctionsaremainlyusedtoperformmathematicalcalculations,suchasRAND()andE().Collectionfunctions:Thesefunctionsareusedtofindthesize,keys,andvaluesforcomplextypes,suchasSIZE(Array<T>).Typeconversionfunctions:ThesearemainlyCASTandBINARYfunctionstoconvertonetypetotheother.Datefunctions:Thesefunctionsareusedtoperformdate-relatedcalculations,suchasYEAR(stringdate)andMONTH(stringdate).Conditionalfunctions:Thesefunctionsareusedtocheckspecificconditionswithadefinedvaluereturned,suchasCOALESCE,IF,andCASEWHEN.Stringfunctions:Thesefunctionsareusedtoperformstring-relatedoperations,suchasUPPER(stringA)andTRIM(stringA).Aggregatefunctions:Thesefunctionsareusedtoperformaggregation(whichisintroducedinthenextchapterformoredetails),suchasSUM(),COUNT(*).Table-generatingfunctions:Thesefunctionstransformasingleinputrowintomultipleoutputrows,suchasEXPLODE(MAP)andJSON_TUPLE(jsonString,k1,k2,…).Customizedfunctions:ThesefunctionsarecreatedbyJavacodeasextensionsforHive.TheyareintroducedinChapter8,ExtensibilityConsiderations.
TolistHivebuilt-infunctions/UDF,wecanusethefollowingcommandsinHiveCLI:
SHOWFUNCTIONS;--Listallfunctions
DESCRIBEFUNCTION<function_name>;--Detailforspecifiedfunction
DESCRIBEFUNCTIONEXTENDED<function_name>;--Evenmoredetails
Thefollowingareafewexamplesandtipsforusingthesefunctions:
Complexdatatypefunctionstips:TheSIZEtypeisusedtocalculatethesizeforMAP,ARRAY,ornestedMAP/ARRAY.Itreturns-1ifthesizeisunknown.Itcanbeimplementedasfollows:
www.it-ebooks.info
jdbc:hive2://>SELECTwork_place,skills_score,depart_title
.......>FROMemployee;
+----------------------+--------------------+--------------------------
-----------+
|work_place|skills_score|depart_title
|
+----------------------+--------------------+--------------------------
-----------+
|["Montreal","Toronto"]|{"DB":80}|{"Product":
["Developer","Lead"]}|
|["Montreal"]|{"Perl":85}|{"Product":
["Lead"],"Test":["Lead"]}|
|["NewYork"]|{"Python":80}|{"Test":["Lead"],"COE":
["Architect"]}|
|["Vancouver"]|{"Sales":89,"HR":94}|{"Sales":["Lead"]}
|
+----------------------+--------------------+--------------------------
-----------+
4rowsselected(0.084seconds)
jdbc:hive2://>SELECTSIZE(work_place)ASarray_size,
.......>SIZE(skills_score)ASmap_size,
.......>SIZE(depart_title)AScomplex_size,
.......>SIZE(depart_title["Product"])ASnest_size
.......>FROMemployee;
+-------------+-----------+---------------+------------+
|array_size|map_size|complex_size|nest_size|
+-------------+-----------+---------------+------------+
|2|1|1|2|
|1|1|2|1|
|1|1|2|-1|
|1|2|1|-1|
+-------------+-----------+---------------+------------+
4rowsselected(0.062seconds)
TheARRAY_CONTAINSstatementcheckswhetherthearraycontainssomevaluestoreturnTRUEorFALSE.TheSORT_ARRAYstatementsortsthearrayinascendingorder.Thesecanbeusedasfollows:
jdbc:hive2://>SELECTARRAY_CONTAINS(work_place,'Toronto')
.......>ASis_Toronto,
.......>SORT_ARRAY(work_place)ASsorted_array
.......>FROMemployee;
+-------------+-------------------------+
|is_toronto|sorted_array|
+-------------+-------------------------+
|true|["Montreal","Toronto"]|
|false|["Montreal"]|
|false|["NewYork"]|
|false|["Vancouver"]|
+-------------+-------------------------+
4rowsselected(0.059seconds)
Datefunctiontips:TheFROM_UNIXTIME(UNIX_TIMESTAMP())statementperformsthesamefunctionasSYSDATEinOracle.Itdynamicallyreturnsthecurrentdate-timein
www.it-ebooks.info
theHiveserver,asfollows:
jdbc:hive2://>SELECT
.......>FROM_UNIXTIME(UNIX_TIMESTAMP())AScurrent_time
.......>FROMemployeeLIMIT1;
+----------------------+
|current_time|
+----------------------+
|2014-11-1519:28:29|
+----------------------+
1rowselected(0.047seconds)
TheUNIX_TIMESTAMP()statementcanbeusedtocomparetwodatesorcanbeusedafterORDERBYtoproperlyorderthedifferentstringtypesofadatevalue,suchasORDERBYUNIX_TIMESTAMP(string_date,'dd-MM-yyyy').Thiscanbeusedasfollows:
--Tocomparethedifferencebetweentwodates.
jdbc:hive2://>SELECT(UNIX_TIMESTAMP('2015-01-2118:00:00')
.......>-UNIX_TIMESTAMP('2015-01-1011:00:00'))/60/60/24
.......>ASdaydiffFROMemployeeLIMIT1;
+---------------------+
|daydiff|
+---------------------+
|11.291666666666666|
+---------------------+
1rowselected(0.093seconds)
TheTO_DATEstatementremovesthehours,minutes,andsecondsfromadate.Thisisusefulwhenweneedtocheckwhetherthevalueofdate-timetypecolumnsiswithinthedatarange,suchasWHERETO_DATE(update_datetime)BETWEEN'2014-11-01'AND'2014-11-31'.Thiscanbeusedasfollows:
jdbc:hive2://>SELECTTO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP()))
.......>AScurrent_dateFROMemployeeLIMIT1;
+---------------+
|current_date|
+---------------+
|2014-11-15|
+---------------+
1rowselected(0.153seconds)
CASEfordifferentdatatypes:BeforeHive0.13.0,thedatatypeafterTHENorELSEneededtobethesame.Otherwise,itwouldgiveanexception,suchasTheexpressionafterELSEshouldhavethesametypeasthoseafterTHEN:“bigint”isexpectedbut“int”isfound.TheworkaroundistouseIF.InHive0.13.0,thisgetsfixed,asshownhere:
jdbc:hive2://>SELECT
.......>CASEWHEN1ISNULLTHEN'TRUE'ELSE0END
.......>AScase_resultFROMemployeeLIMIT1;
+--------------+
|case_result|
+--------------+
www.it-ebooks.info
|0|
+--------------+
1rowselected(0.063seconds)
Parserandsearchtips:TheLATERALVIEWstatementisusedwithuser-definedtablegeneratingfunctionssuchasEXPLODE()toflattenthemaporarraytypeofacolumn.TheexplodefunctioncanbeusedonbothARRAYandMAPwithLATERALVIEW.IfevenoneofthecolumnsexplodedisNULL,thewholerowisfilteredout,suchastherowofSteveninthefollowingexample.Toavoidthis,OUTERLATERALVIEWcanbeusedasfollowssinceHive0.12.0:
--Preparedata
jdbc:hive2://>INSERTINTOTABLEemployee
.......>SELECT'Steven'ASname,array(null)aswork_place,
.......>named_struct("sex","Male","age",30)assex_age,
.......>map("Python",90)asskills_score,
.......>map("R&D",array('Developer'))asdepart_title
.......>FROMemployeeLIMIT1;
Norowsaffected(28.187seconds)
jdbc:hive2://>SELECTname,work_place,skills_score
.......>FROMemployee;
+----------+-------------------------+-----------------------+
|name|work_place|skills_score|
+----------+-------------------------+-----------------------+
|Michael|["Montreal","Toronto"]|{"DB":80}|
|Will|["Montreal"]|{"Perl":85}|
|Shelley|["NewYork"]|{"Python":80}|
|Lucy|["Vancouver"]|{"Sales":89,"HR":94}|
|Steven|NULL|{"Python":90}|
+----------+-------------------------+-----------------------+
5rowsselected(0.053seconds)
--LATERALVIEWignorestherowswhenEXPLOREreturnsNULL
jdbc:hive2://>SELECTname,workplace,skills,score
.......>FROMemployee
.......>LATERALVIEWexplode(work_place)wpASworkplace
.......>LATERALVIEWexplode(skills_score)ss
.......>ASskills,score;
+----------+------------+---------+--------+
|name|workplace|skills|score|
+----------+------------+---------+--------+
|Michael|Montreal|DB|80|
|Michael|Toronto|DB|80|
|Will|Montreal|Perl|85|
|Shelley|NewYork|Python|80|
|Lucy|Vancouver|Sales|89|
|Lucy|Vancouver|HR|94|
+----------+------------+---------+--------+
6rowsselected(24.733seconds)
--OUTERLATERALVIEWkeepsrowswhenEXPLOREreturnsNULL
jdbc:hive2://>SELECTname,workplace,skills,score
.......>FROMemployee
.......>LATERALVIEWOUTERexplode(work_place)wp
www.it-ebooks.info
.......>ASworkplace
.......>LATERALVIEWexplode(skills_score)ss
.......>ASskills,score;
+----------+------------+---------+--------+
|name|workplace|skills|score|
+----------+------------+---------+--------+
|Michael|Montreal|DB|80|
|Michael|Toronto|DB|80|
|Will|Montreal|Perl|85|
|Shelley|NewYork|Python|80|
|Lucy|Vancouver|Sales|89|
|Lucy|Vancouver|HR|94|
|Steven|None|Python|90|
+----------+------------+---------+--------+
7rowsselected(24.573seconds)
TheREVERSEstatementcanbeusedtoreversetheorderofeachletterinastring.TheSPLITstatementcanbeusedtotokenizethestringusingaspecifiedtokenizer.ThefollowingisanexampleofusingthemtogetthefilenamefromaLinuxpath:
jdbc:hive2://>SELECT
.......>reverse(split(reverse('/home/user/employee.txt'),'/')
[0])
.......>ASlinux_file_nameFROMemployeeLIMIT1;
+------------------+
|linux_file_name|
+------------------+
|employee.txt|
+------------------+
1rowselected(0.1seconds)
Whereasreverseoutputseachelementinanarrayormapasseparaterows,collect_setandcollect_listdoestheoppositebyreturningasetwithelementsfromeachrow.Thecollect_setstatementwillremoveduplicationsfromtheresult,butcollect_listdoesnot.Thisisshownhere:
jdbc:hive2://>SELECTcollect_set(work_place[0])
.......>ASflat_workplace0FROMemployee;
+--------------------------------------+
|flat_workplace0|
+--------------------------------------+
|["Vancouver","Montreal","NewYork"]|
+--------------------------------------+
1rowselected(43.455seconds)
jdbc:hive2://>SELECTcollect_list(work_place[0])
.......>ASflat_workplace0FROMemployee;
+-------------------------------------------------+
|flat_workplace0|
+-------------------------------------------------+
|["Montreal","Montreal","NewYork","Vancouver"]|
+-------------------------------------------------+
1rowselected(45.488seconds)
Virtualcolumns:VirtualcolumnsarespecialfunctiontypeofcolumnsinHive.
www.it-ebooks.info
Rightnow,Hiveofferstwovirtualcolumns:INPUT__FILE__NAMEandBLOCK__OFFSET__INSIDE__FILE.TheINPUT__FILE__NAMEfunctionistheinputfile’snameforamappertask.TheBLOCK__OFFSET__INSIDE__FILEfunctionisthecurrentglobalfilepositionorcurrentblock’sfileoffsetifthefileiscompressed.ThefollowingareexamplestousevirtualcolumnstoknowtheplacewherethedataisphysicallylocatedintheHDFS,especiallyforbucketedandpartitionedtables:
jdbc:hive2://>SELECTINPUT__FILE__NAME,
.......>BLOCK__OFFSET__INSIDE__FILEASOFFSIDE
.......>FROMemployee_id_buckets;
+---------------------------------------------------------+----------+
|input__file__name|offside|
+---------------------------------------------------------+----------+
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|0|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|55|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|120|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|175|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|240|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|295|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|360|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|415|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|480|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|535|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|592|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|657|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|712|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|769|
|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|834|
|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|0|
|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|57|
|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|122|
|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|177|
|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|234|
|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|291|
|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|348|
|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|405|
|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|462|
|hdfs://hive_warehouse_URI/employee_id_buckets/000001_0|517|
+---------------------------------------------------------+----------+
25rowsselected(0.073seconds)
jdbc:hive2://>SELECTINPUT__FILE__NAMEFROMemployee_partitioned;
+----------------------------------------------------------------------
---+
|input__file__name
|
+----------------------------------------------------------------------
---+
|hdfs://warehouse_URI/employee_partitioned/year=2010/month=1/000000_0
|hdfs://warehouse_URI/employee_partitioned/year=2012/month=11/000000_0
|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.
txt
|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.
txt
www.it-ebooks.info
|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.
txt
|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.
txt
|hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0
|hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0
|hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0
|hdfs://warehouse_URI/employee_partitioned/year=2015/month=01/000000_0
+----------------------------------------------------------------------
---+
10rowsselected(0.47seconds)
FunctionsnotmentionedintheHivewiki:ThefollowingarethefunctionsnotmentionedintheHivewiki:
--Functionstocheckfornullvalues
jdbc:hive2://>SELECTwork_place,isnull(work_place)is_null,
.......>isnotnull(work_place)is_not_nullFROMemployee;
+-------------------------+----------+--------------+
|work_place|is_null|is_not_null|
+-------------------------+----------+--------------+
|["Montreal","Toronto"]|false|true|
|["Montreal"]|false|true|
|["NewYork"]|false|true|
|["Vancouver"]|false|true|
|NULL|true|false|
+-------------------------+----------+--------------+
5rowsselected(0.058seconds)
--assert_true,throwanexceptionif'condition'isnottrue.
jdbc:hive2://>SELECTassert_true(work_placeISNULL)
.......>FROMemployee;
Error:java.io.IOException:
org.apache.hadoop.hive.ql.metadata.HiveException:ASSERT_TRUE():
assertionfailed.(state=,code=0)
--elt(n,str1,str2,...),returnsthen-thstring
jdbc:hive2://>SELECTelt(2,'NewYork','Montreal','Toronto')
.......>FROMemployeeLIMIT1;
+-----------+
|_c0|
+-----------+
|Montreal|
+-----------+
1rowselected(0.055seconds)
--Returnthenameofcurrent_databasesinceHive0.13.0
jdbc:hive2://>SELECTcurrent_database();
+----------+
|_c0|
+----------+
|default|
+----------+
1rowselected(0.057seconds)
www.it-ebooks.info
www.it-ebooks.info
TransactionsBeforeHiveversion0.13.0,Hivedoesnotsupportrow-leveltransactions.Asaresult,thereisnowaytoupdate,insert,ordeleterowsofdata.Hence,dataoverwritecanonlyhappenontablesorpartitions.ThismakesHiveverydifficultwhendealingwithconcurrentread/writeanddata-cleaningusecases.
SinceHiveversion0.13.0,Hivefullysupportsrow-leveltransactionsbyofferingfullAtomicity,Consistency,Isolation,andDurability(ACID)toHive.Fornow,allthetransactionsareautocommutedandonlysupportdataintheOptimizedRowColumnar(ORC)file(availablesinceHive0.11.0)formatandinbucketedtables.
ThefollowingconfigurationparametersmustbesetappropriatelytoturnontransactionsupportinHive:
SEThive.support.concurrency=true;
SEThive.enforce.bucketing=true;
SEThive.exec.dynamic.partition.mode=nonstrict;
SEThive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SEThive.compactor.initiator.on=true;
SEThive.compactor.worker.threads=1;
TheSHOWTRANSACTIONScommandisaddedsinceHive0.13.0toshowcurrentlyopenandabortedtransactionsinthesystem:
jdbc:hive2://>SHOWTRANSACTIONS;
+-----------------+--------------------+-------+-----------+
|txnid|state|user|host|
+-----------------+--------------------+-------+-----------+
|TransactionID|TransactionState|User|Hostname|
+-----------------+--------------------+-------+-----------+
1rowselected(15.209seconds)
SinceHive0.14.0,theINSERTVALUE,UPDATE,andDELETEcommandsareaddedtooperaterowswiththefollowingsyntax:
INSERTINTOTABLEtablename[PARTITION(partcol1[=val1],partcol2[=val2]
...)]
VALUESvalues_row[,values_row…];
UPDATEtablenameSETcolumn=value[,column=value…][WHEREexpression]
DELETEFROMtablename[WHEREexpression]
www.it-ebooks.info
www.it-ebooks.info
SummaryInthischapter,wecoveredhowtoexchangedatabetweenHiveandfilesusingtheLOAD,INSERT,IMPORT,andEXPORTkeywords.Then,weintroducedthedifferentHiveorderingandsortingoptions.WealsocoveredsomecommonlyusedtipsusingHivefunctions.Finally,weprovidedanoverviewofrow-leveltransactionsthatarenewlysupportedsinceHive0.13.0.Aftergoingthroughthischapter,weshouldbeabletoimportorexportdatatoHive.Weshouldbeexperiencedinusingdifferenttypesoforderingandsortingkeywords,Hivefunctions,andtransactions.
Inthenextchapter,we’lllookatthedifferentwaysofcarryingoutdataaggregationsandsamplinginHive.
www.it-ebooks.info
www.it-ebooks.info
Chapter6.DataAggregationandSamplingThischapterisabouthowtoaggregateandsampledatainHive.Itfirstlycoverstheusageofseveralaggregationfunctions,analyticfunctionsworkingwithGROUPBYandPARTITIONBY,andwindowingclauses.Then,itintroducesdifferentwaysofsamplingdatainHive.
Inthischapter,wewillcoverthefollowingtopics:
BasicaggregationAdvancedaggregationAggregationconditionAnalyticfunctionsSampling
www.it-ebooks.info
Basicaggregation–GROUPBYDataaggregationisanyprocesstogatherandexpressdatainasummaryformtogetmoreinformationaboutparticulargroupsbasedonspecificconditions.Hiveoffersseveralbuilt-inaggregatefunctions,suchasMAX,MIN,AVG,andsoon.HivealsosupportsadvancedaggregationbyusingGROUPINGSETS,ROLLUP,CUBE,analyticfunctions,andwindowing.
TheHivebasicbuilt-inaggregatefunctionsareusuallyusedwiththeGROUPBYclause.IfthereisnoGROUPBYclausespecified,itaggregatesoverthewholetablebydefault.Besidesaggregatefunctions,allothercolumnsthatareselectedmustalsobeincludedintheGROUPBYclause.Thefollowingareafewexamplesusingthebuilt-inaggregatefunctions:
AggregationwithoutGROUPBYcolumns:
jdbc:hive2://>SELECTcount(*)ASrow_cntFROMemployee;
+----------+
|row_cnt|
+----------+
|5|
+----------+
1rowselected(60.709seconds)
AggregationwithGROUPBYcolumns:
jdbc:hive2://>SELECTsex_age.sex,count(*)ASrow_cnt
.......>FROMemployee
.......>GROUPBYsex_age.sex;
+--------------+----------+
|sex_age.sex|row_cnt|
+--------------+----------+
|Female|2|
|Male|3|
+--------------+----------+
2rowsselected(100.565seconds)
--Thecolumnnameselectedisnotgroupbycolumns
jdbc:hive2://>SELECTname,sex_age.sex,count(*)ASrow_cnt
.......>FROMemployeeGROUPBYsex_age.sex;
Error:Errorwhilecompilingstatement:FAILED:SemanticException
[Error10025]:Line1:7ExpressionnotinGROUPBYkey'name'
(state=42000,code=10025)
IfwehavetoselectthecolumnsthatarenotGROUPBYcolumns,onewayistouseanalyticfunctions,whichareintroducedlater,tocompletelyavoidusingtheGROUPBYclause.Theotherwayistousethecollect_setfunction,whichreturnsasetofobjectswithduplicateelementseliminatedasfollows:
--Findrowcountbysexandasampledageforeachsex
jdbc:hive2://>SELECTsex_age.sex,
.......>collect_set(sex_age.age)[0]ASrandom_age,
.......>count(*)ASrow_cnt
.......>FROMemployeeGROUPBYsex_age.sex;
www.it-ebooks.info
+--------------+-------------+----------+
|sex_age.sex|random_age|row_cnt|
+--------------+-------------+----------+
|Female|27|2|
|Male|35|3|
+--------------+-------------+----------+
2rowsselected(48.15seconds)
Theaggregatefunctioncanbeusedwithotheraggregatefunctionsinthesameselectstatement.Itcanalsobeusedwithotherfunctions,suchasconditionalfunctions,inthenestedway.However,nestedaggregatefunctionsarenotsupported.Seethefollowingexamplesformoredetails:
MultipleaggregatefunctionsarecalledinthesameSELECTstatement,asfollows:
jdbc:hive2://>SELECTsex_age.sex,AVG(sex_age.age)ASavg_age,
.......>count(*)ASrow_cnt
.......>FROMemployeeGROUPBYsex_age.sex;
+--------------+---------------------+----------+
|sex_age.sex|avg_age|row_cnt|
+--------------+---------------------+----------+
|Female|42.0|2|
|Male|31.666666666666668|3|
+--------------+---------------------+----------+
2rowsselected(98.857seconds)
TheseaggregatefunctionsareusedwithCASEWHEN,asfollows:
jdbc:hive2://>SELECTsum(CASEWHENsex_age.sex='Male'
.......>THENsex_age.ageELSE0END)/
.......>count(CASEWHENsex_age.sex='Male'THEN1
.......>ELSENULLEND)ASmale_age_avgFROMemployee;
+---------------------+
|male_age_avg|
+---------------------+
|31.666666666666668|
+---------------------+
1rowselected(38.415seconds)
TheseaggregatefunctionsareusedwithCOALESCEandIF,asfollows:
jdbc:hive2://>SELECT
.......>sum(coalesce(sex_age.age,0))ASage_sum,
.......>sum(if(sex_age.sex='Female',sex_age.age,0))
.......>ASfemale_age_sumFROMemployee;
+----------+---------------+
|age_sum|female_age_sum|
+----------+---------------+
|179|84|
+----------+---------------+
1rowselected(42.137seconds)
Nestedaggregatefunctionsarenotallowed,asshownhere:
jdbc:hive2://>SELECTavg(count(*))ASrow_cnt
.......>FROMemployee;
Error:Errorwhilecompilingstatement:FAILED:SemanticException
www.it-ebooks.info
[Error10128]:Line1:11NotyetsupportedplaceforUDAF'count'
(state=42000,code=10128)
AggregatefunctionscanalsobeusedwiththeDISTINCTkeywordtodoaggregationonuniquevalues:
jdbc:hive2://>SELECTcount(DISTINCTsex_age.sex)ASsex_uni_cnt,
.......>count(DISTINCTname)ASname_uni_cnt
.......>FROMemployee;
+--------------+---------------+
|sex_uni_cnt|name_uni_cnt|
+--------------+---------------+
|2|5|
+--------------+---------------+
1rowselected(35.935seconds)
NoteWhenweuseCOUNTandDISTINCTtogether,Hivealwaysignoresthesetting(suchasmapred.reduce.tasks=20)forthenumberofreducersusedandusesonlyonereducer.Inthiscase,thesinglereducerbecomesthebottleneckwhenprocessingbigvolumesofdata.Theworkaroundistousethesubqueryasfollows:
--Triggersinglereducerduringthewholeprocessing
SELECTcount(distinctsex_age.sex)ASsex_uni_cntFROMemployee;
--Usesubquerytoselectuniquevaluebeforeaggregationsforbetter
performance
SELECTcount(*)ASsex_uni_cntFROM(SELECTdistinctsex_age.sexFROM
employee)a;
Inthiscase,thefirststageofthequeryimplementingDISTINCTcanusemorethanonereducer.Inthesecondstage,themapperwillhavelessoutputjustfortheCOUNTpurposesincethedataisalreadyuniqueafterimplementingDISTINCT.Asaresult,thereducerwillnotbeoverloaded.
WemayencounteraveryspecialbehaviorwhenHivedealswithaggregationacrosscolumnswithaNULLvalue.Theentirerow(ifonecolumnhasNULLasavalueintherow)willbeignoredinthesecondrowofthefollowingexample.Toavoidthis,wecanuseCOALESCEtoassignadefaultvaluewhenthecolumnvalueisNULL.Thiscanbedoneasfollows:
--Createatabletfortesting
jdbc:hive2://>CREATETABLEtASSELECT*FROM
.......>(SELECTemployee_id-99ASval1,
.......>(employee_id-98)ASval2FROMemployee_hr
.......>WHEREemployee_id<=101
.......>UNIONALL
.......>SELECTnullval1,2ASval2FROMemployee_hr
.......>WHEREemployee_id=100)a;
Norowsaffected(0.138seconds)
--Checktherowsinthetablecreated
jdbc:hive2://>SELECT*FROMt;
+---------+---------+
www.it-ebooks.info
|t.val1|t.val2|
+---------+---------+
|1|2|
|NULL|2|
|2|3|
+---------+---------+
3rowsselected(0.069seconds)
--The2ndrow(NULL,2)isignoredwhendoingsum(val1+val2)
jdbc:hive2://>SELECTsum(val1),sum(val1+val2)
.......>FROMt;
+------+------+
|_c0|_c1|
+------+------+
|3|8|
+------+------+
1rowselected(57.775seconds)
jdbc:hive2://>SELECTsum(coalesce(val1,0)),
.......>sum(coalesce(val1,0)+val2)FROMt;
+------+------+
|_c0|_c1|
+------+------+
|3|10|
+------+------+
1rowselected(69.967seconds)
Thehive.map.aggrpropertycontrolsaggregationsinthemaptask.Thedefaultvalueforthissettingisfalse.Ifitissettotrue,Hivewilldothefirst-levelaggregationdirectlyinthemaptaskforbetterperformance,butconsumemorememory:
jdbc:hive2://>SEThive.map.aggr=true;
Norowsaffected(0.002seconds)
www.it-ebooks.info
www.it-ebooks.info
Advancedaggregation–GROUPINGSETSHivehasofferedtheGROUPINGSETSkeywordstoimplementadvancedmultipleGROUPBYoperationsagainstthesamesetofdata.Actually,GROUPINGSETSisashorthandwayofconnectingseveralGROUPBYresultsetswithUNIONALL.TheGROUPINGSETSkeywordcompletesallprocessesinonestageofjobs,whichismoreefficientthanGROUPBYandUNIONALLhavingmultiplestages.Ablankset()intheGROUPINGSETSclausecalculatestheoverallaggregation.ThefollowingareafewexamplestoshowtheequivalenceofGROUPINGSETS.Forbetterunderstanding,wecansaythattheouterlevelofGROUPINGSETSdefinesonwhatdataUNIONALListobeimplemented.TheinnerleveldefinesonwhatdataGROUPBYistobeimplementedineachUNIONALL.
SELECTname,work_place[0]ASmain_place,
count(employee_id)ASemp_id_cnt
FROMemployee_id
GROUPBYname,work_place[0]GROUPINGSETS((name,work_place[0]));
||
SELECTname,work_place[0]ASmain_place,
count(employee_id)ASemp_id_cnt
FROMemployee_id
GROUPBYname,work_place[0]
SELECTname,work_place[0]ASmain_place,
count(employee_id)ASemp_id_cnt
FROMemployee_id
GROUPBYname,work_place[0]GROUPINGSETS(name,work_place[0]);
||
SELECTname,NULLASmain_place,count(employee_id)ASemp_id_cnt
FROMemployee_id
GROUPBYname
UNIONALL
SELECTNULLASname,work_place[0]ASmain_place,
count(employee_id)ASemp_id_cnt
FROMemployee_id
GROUPBYwork_place[0];
SELECTname,work_place[0]ASmain_place,
count(employee_id)ASemp_id_cnt
FROMemployee_id
GROUPBYname,work_place[0]
GROUPINGSETS((name,work_place[0]),name);
||
SELECTname,work_place[0]ASmain_place,
count(employee_id)ASemp_id_cnt
FROMemployee_id
GROUPBYname,work_place[0]
UNIONALL
SELECTname,NULLASmain_place,count(employee_id)ASemp_id_cnt
FROMemployee_id
GROUPBYname;
www.it-ebooks.info
SELECTname,work_place[0]ASmain_place,
count(employee_id)ASemp_id_cnt
FROMemployee_id
GROUPBYname,work_place[0]
GROUPINGSETS((name,work_place[0]),name,work_place[0],());
||
SELECTname,work_place[0]ASmain_place,
count(employee_id)ASemp_id_cnt
FROMemployee_id
GROUPBYname,work_place[0]
UNIONALL
SELECTname,NULLASmain_place,count(employee_id)ASemp_id_cnt
FROMemployee_id
GROUPBYname
UNIONALL
SELECTNULLASname,work_place[0]ASmain_place,
count(employee_id)ASemp_id_cnt
FROMemployee_id
GROUPBYwork_place[0]
UNIONALL
SELECTNULLASname,NULLASmain_place,
count(employee_id)ASemp_id_cnt
FROMemployee_id;
However,theGROUPINGSETSoperationstillhasunresolvedissueswhenworkingwithcolumnsreferredbyatableorrecordtypealias(seeApacheJiraHIVE-6950athttps://issues.apache.org/jira/browse/HIVE-6950).Thisisshownhere:
jdbc:hive2://>SELECTsex_age.sex,sex_age.age,
.......>count(name)ASname_cnt
.......>FROMemployee
.......>GROUPBYsex_age.sex,sex_age.age
.......>GROUPINGSETS((sex_age.sex,sex_age.age));
Error:Errorwhilecompilingstatement:FAILED:ParseExceptionline1:131
missing)at','near'<EOF>'
line1:145extraneousinput')'expectingEOFnear'<EOF>'
(state=42000,code=40000)
www.it-ebooks.info
www.it-ebooks.info
Advancedaggregation–ROLLUPandCUBETheROLLUPstatementenablesaSELECTstatementtocalculatemultiplelevelsofaggregationsacrossaspecifiedgroupofdimensions.TheROLLUPstatementisasimpleextensiontotheGROUPBYclausewithhighefficiencyandminimaloverheadtoaquery.ComparedtoGROUPINGSETSthatcreatesspecifiedlevelsofaggregations,ROLLUPcreatesn+1levelsofaggregations,wherenisthenumberofgroupingcolumns.First,itcalculatesthestandardaggregatevaluesspecifiedintheGROUPBYclause.Then,itcreateshigher-levelsubtotals,movingfromrighttoleftthroughthelistofcombinationsofgroupingcolumns,asshowninthefollowingexample:
GROUPBYa,b,cWITHROLLUP
Thisisequivalenttothefollowing:
GROUPBYa,b,cGROUPINGSETS((a,b,c),(a,b),(a),())
TheCUBEstatementtakesaspecifiedsetofgroupingcolumnsandcreatesaggregationsforalloftheirpossiblecombinations.IfncolumnsarespecifiedforCUBE,therewillbe2ncombinationsofaggregationsreturned,asshowninthefollowingexample:
GROUPBYa,b,cWITHCUBE
Thisisequivalenttothefollowing:
GROUPBYa,b,cGROUPINGSETS((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),())
TheGROUPING__IDfunctionworksasanextensiontodistinguishentirerowsfromeachother.ItacceptsoneormorecolumnsandreturnsthedecimalequivalentoftheBITvectorforeachcolumnspecifiedafterGROUPBY.Thereturneddecimalnumberisconvertedfromabinaryof1sand0s,whichrepresentswhetherthecolumnisaggregated(valueisnotNULL)intherow.TheorderofcolumnsstartsfromcountingthenearestcolumnfromGROUPBY.Inthefollowingexample,thefirstcolumnisstart_date:
jdbc:hive2://>SELECTGROUPING__ID,
.......>BIN(CAST(GROUPING__IDASBIGINT))ASbit_vector,
.......>name,start_date,count(employee_id)emp_id_cnt
.......>FROMemployee_hr
.......>GROUPBYstart_date,name
.......>WITHCUBEORDERBYstart_date;
+---------------+-------------+----------+-------------+------------+
|grouping__id|bit_vector|name|start_date|emp_id_cnt|
+---------------+-------------+----------+-------------+------------+
|2|10|Steven|NULL|1|
|2|10|Michael|NULL|1|
|2|10|Lucy|NULL|1|
|0|0|NULL|NULL|4|
|2|10|Will|NULL|1|
|3|11|Lucy|2010-01-03|1|
|1|1|NULL|2010-01-03|1|
www.it-ebooks.info
|1|1|NULL|2012-11-03|1|
|3|11|Steven|2012-11-03|1|
|1|1|NULL|2013-10-02|1|
|3|11|Will|2013-10-02|1|
|1|1|NULL|2014-01-29|1|
|3|11|Michael|2014-01-29|1|
+---------------+-------------+----------+-------------+------------+
13rowsselected(136.708seconds)
www.it-ebooks.info
www.it-ebooks.info
Aggregationcondition–HAVINGSinceHive0.7.0,HAVINGisaddedtosupporttheconditionalfilteringofGROUPBYresults.ByusingHAVING,wecanavoidusingasubqueryafterGROUPBY.Thefollowingisanexample:
jdbc:hive2://>SELECTsex_age.ageFROMemployee
.......>GROUPBYsex_age.ageHAVINGcount(*)<=1;
+--------------+
|sex_age.age|
+--------------+
|57|
|27|
|35|
+--------------+
3rowsselected(74.376seconds)
IfwedonotuseHAVING,wecanuseasubqueryforinstanceasfollows:
jdbc:hive2://>SELECTa.age
.......>FROM
.......>(SELECTcount(*)ascnt,sex_age.age
.......>FROMemployeeGROUPBYsex_age.age
.......>)aWHEREa.cnt<=1;
+--------+
|a.age|
+--------+
|57|
|27|
|35|
+--------+
3rowsselected(87.298seconds)
www.it-ebooks.info
www.it-ebooks.info
AnalyticfunctionsAnalyticfunctions,availablesinceHive0.11.0,areaspecialgroupoffunctionsthatscanthemultipleinputrowstocomputeeachoutputvalue.AnalyticfunctionsareusuallyusedwithOVER,PARTITIONBY,ORDERBY,andthewindowingspecification.DifferentfromtheregularaggregatefunctionsusedwiththeGROUPBYclausethatislimitedtooneresultvaluepergroup,analyticfunctionsoperateonwindowswheretheinputrowsareorderedandgroupedusingflexibleconditionsexpressedthroughanOVERPARTITIONclause.Thoughanalyticfunctionsgiveaggregateresults,theydonotgrouptheresultset.Theyreturnthegroupvaluemultipletimeswitheachrecord.TheanalyticfunctionsoffergreatflexibilityandfunctionalitiesthantheregularGROUPBYclauseandmakespecialaggregationsinHiveeasierandpowerful.Thesyntaxfortheanalyzefunctionisasfollows:
Function(arg1,...,argn)OVER([PARTITIONBY<...>][ORDERBY<....>]
[<window_clause>])
TheFunction(arg1,...,argn)canbeanyfunctioninthefollowinglistwithexamples:
Standardaggregations:ThiscanbeeitherCOUNT(),SUM(),MIN(),MAX(),orAVG().RANK:Itranksitemsinagroup,suchasfindingthetopNrowsforspecificconditions.DENSE_RANK:ItissimilartoRANK,butleavesnogapsintherankingsequencewhenthereareties.Forexample,ifwerankamatchusingDENSE_RANKandhadtwoplayerstieforsecondplace,wewouldseethatthetwoplayerswereinsecondplaceandthatthenextpersonisrankedasthird.However,theRANKfunctionwouldalsoranktwopeopleinsecondplace,butthenextpersonwouldbeinfourthplace.ROW_NUMBER:Itassignsauniquesequencenumberstartingfrom1toeachrowaccordingtothepartitionandorderspecification.CUME_DIST:Itcomputesthenumberofrowswhosevalueissmallerorequaltothevalueofthetotalnumberofrowsdividedbythecurrentrow.PERCENT_RANK:ItissimilartoCUME_DIST,butitusesrankvaluesratherthanrowcountsinitsnumeratorastotalnumberofrows-1dividedbycurrentrank-1.Therefore,itreturnsthepercentrankofavaluerelativetoagroupofvalues.NTILE:Itdividesanordereddatasetintonumberofbucketsandassignsanappropriatebucketnumbertoeachrow.Itcanbeusedtodividerowsintoequalsetsandassignanumbertoeachrow.LEAD:TheLEADfunction,lead(value_expr[,offset[,default]]),isusedtoreturndatafromthenextrow.Thenumber(value_expr)ofrowstoleadcanoptionallybespecified.Ifthenumberofrows(offset)toleadisnotspecified,theleadisonerowbydefault.Itreturns[,default]ornullwhenthedefaultisnotspecifiedandtheleadforthecurrentrowextendsbeyondtheendofthewindow.LAG:TheLAGfunction,lag(value_expr[,offset[,default]]),isusedtoaccessdatafromapreviousrow.Thenumber(value_expr)ofrowstolagcanoptionallybespecified.Ifthenumberofrows(offset)tolagisnotspecified,thelagisonerowbydefault.Itreturns[,default]ornullwhenthedefaultisnotspecifiedandthelagforthecurrentrowextendsbeyondtheendofthewindow.
www.it-ebooks.info
FIRST_VALUE:Itreturnsthefirstresultfromanorderedset.LAST_VALUE:Itreturnsthelastresultfromanorderedset.ForLAST_VALUE,usingthedefaultwindowingclause,theresultcanbealittleunexpected.ThisisbecausethedefaultwindowingclauseisRANGEBETWEENUNBOUNDEDPRECEDINGANDCURRENTROW,whichinthisexamplemeansthecurrentrowwillalwaysbethelastvalue.ChangingthewindowingclausetoRANGEBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWINGgivesustheresultweprobablyexpected(seethelast_valuecolumninthefollowingexamples).
The[PARTITIONBY<...>]statementissimilartotheGROUPBYclause.Itdividestherowsintogroupscontainingidenticalvaluesinoneormorepartitionsbycolumns.Theselogicalgroupsareknownaspartitions,whichisnotthesametermusedforpartitiontables.OmittingthePARTITIONBYstatementappliestheanalyticoperationtoalltherowsinthetable.
The[ORDERBY<....>]clauseisliketheORDERBYexpr[ASC|DESC]clause.TheORDERBYclauseisthesameastheregularORDERBYclause.ItmakessuretherowsproducedbythePARTITIONBYclauseareorderedbyspecifications,suchasascendingordescendingorder.Rightnow,HiveonlysupportsoneORDERBYcolumninthiscase.Otherwise,itwillthrowasemanticexception(seeApacheJiraHIVE-4662athttps://issues.apache.org/jira/browse/HIVE-4662).Theworkaroundistousetherowsunboundedprecedingwindowingclause(seerunningTotal2columninthefollowingexamples):
Preparethetableanddatafordemonstration:
jdbc:hive2://>CREATETABLEIFNOTEXISTSemployee_contract
.......>(
.......>namestring,
.......>dept_numint,
.......>employee_idint,
.......>salaryint,
.......>typestring,
.......>start_datedate
.......>)
.......>ROWFORMATDELIMITED
.......>FIELDSTERMINATEDBY'|'
.......>STOREDASTEXTFILE;
Norowsaffected(0.282seconds)
jdbc:hive2://>LOADDATALOCALINPATH
.......>'/home/dayongd/Downloads/employee_contract.txt'
.......>OVERWRITEINTOTABLEemployee_contract;
Norowsaffected(0.48seconds)
Theregularaggregationsareusedasanalyticfunctions,asfollows:
jdbc:hive2://>SELECTname,dept_num,salary,
.......>COUNT(*)OVER(PARTITIONBYdept_num)ASrow_cnt,
.......>SUM(salary)OVER(PARTITIONBYdept_num
.......>ORDERBYdept_num)ASdeptTotal,
.......>SUM(salary)OVER(ORDERBYdept_num)
www.it-ebooks.info
.......>ASrunningTotal1,SUM(salary)
.......>OVER(ORDERBYdept_num,namerowsunbounded
.......>preceding)ASrunningTotal2
.......>FROMemployee_contract
.......>ORDERBYdept_num,name;
+-------+--------+------+-------+---------+-------------+-------------+
|name|dept_num|salary|row_cnt|deptTotal|runningTotal1|runningTotal2|
+-------+--------+------+-------+---------+-------------+-------------+
|Lucy|1000|5500|5|24900|24900|5500|
|Michael|1000|5000|5|24900|24900|10500|
|Steven|1000|6400|5|24900|24900|16900|
|Will|1000|4000|5|24900|24900|24900|
|Will|1000|4000|5|24900|24900|20900|
|Jess|1001|6000|3|17400|42300|30900|
|Lily|1001|5000|3|17400|42300|35900|
|Mike|1001|6400|3|17400|42300|42300|
|Richard|1002|8000|3|20500|62800|50300|
|Wei|1002|7000|3|20500|62800|57300|
|Yun|1002|5500|3|20500|62800|62800|
+-------+--------+------+-------+---------+-------------+-------------+
11rowsselected(359.918seconds)
Otheranalyticfunctionsareusedasfollows:
jdbc:hive2://>SELECTname,dept_num,salary,
.......>RANK()OVER(PARTITIONBYdept_numORDERBYsalary)
.......>ASrank,
.......>DENSE_RANK()
.......>OVER(PARTITIONBYdept_numORDERBYsalary)
.......>ASdense_rank,ROW_NUMBER()OVER()ASrow_num,
.......>ROUND((CUME_DIST()OVER(PARTITIONBYdept_num
.......>ORDERBYsalary)),1)AScume_dist,
.......>PERCENT_RANK()OVER(PARTITIONBYdept_num
.......>ORDERBYsalary)ASpercent_rank,NTILE(4)
.......>OVER(PARTITIONBYdept_numORDERBYsalary)
.......>ASntile
.......>FROMemployee_contractORDERBYdept_num;
+-------+--------+------+----+----------+-------+---------+------------
+-----+
|name
|dept_num|salary|rank|dense_rank|row_num|cume_dist|percent_rank|ntile|
+-------+--------+------+----+----------+-------+---------+------------
+-----+
|Will|1000|4000|1|1|11|0.4|0.0
|1|
|Will|1000|4000|1|1|10|0.4|0.0
|1|
|Michael|1000|5000|3|2|9|0.6|0.5
|2|
|Lucy|1000|5500|4|3|8|0.8|0.75
|3|
|Steven|1000|6400|5|4|7|1.0|1.0
|4|
|Lily|1001|5000|1|1|6|0.3|0.0
|1|
|Jess|1001|6000|2|2|5|0.7|0.5
www.it-ebooks.info
|2|
|Mike|1001|6400|3|3|4|1.0|1.0
|3|
|Yun|1002|5500|1|1|3|0.3|0.0
|1|
|Wei|1002|7000|2|2|2|0.7|0.5
|2|
|Richard|1002|8000|3|3|1|1.0|1.0
|3|
+-------+--------+------+----+----------+-------+---------+------------
+-----+
11rowsselected(367.112seconds)
jdbc:hive2://>SELECTname,dept_num,salary,
.......>LEAD(salary,2)OVER(PARTITIONBYdept_num
.......>ORDERBYsalary)ASlead,
.......>LAG(salary,2,0)OVER(PARTITIONBYdept_num
.......>ORDERBYsalary)ASlag,
.......>FIRST_VALUE(salary)OVER(PARTITIONBYdept_num
.......>ORDERBYsalary)ASfirst_value,
.......>LAST_VALUE(salary)OVER(PARTITIONBYdept_num
.......>ORDERBYsalary)ASlast_value_default,
.......>LAST_VALUE(salary)OVER(PARTITIONBYdept_num
.......>ORDERBYsalary
.......>RANGEBETWEENUNBOUNDEDPRECEDING
.......>ANDUNBOUNDEDFOLLOWING)ASlast_value
.......>FROMemployee_contractORDERBYdept_num;
+-------+--------+------+----+----+-----------+------------------+-----
----+
|name|dept_num|salary|lead|lag|first_value|last_value_default|
last_value
|
+-------+--------+------+----+----+-----------+------------------+-----
----+
|Will|1000|4000|5000|0|4000|4000|6400
|
|Will|1000|4000|5500|0|4000|4000|6400
|
|Michael|1000|5000|6400|4000|4000|5000|6400
|
|Lucy|1000|5500|NULL|4000|4000|5500|6400
|
|Steven|1000|6400|NULL|5000|4000|6400|6400
|
|Lily|1001|5000|6400|0|5000|5000|6400
|
|Jess|1001|6000|NULL|0|5000|6000|6400
|
|Mike|1001|6400|NULL|5000|5000|6400|6400
|
|Yun|1002|5500|8000|0|5500|5500|8000
|
|Wei|1002|7000|NULL|0|5500|7000|8000
|
|Richard|1002|8000|NULL|5500|5500|8000|8000
|
www.it-ebooks.info
+-------+--------+------+----+----+-----------+------------------+-----
----+
11rowsselected(92.572seconds)
The[<window_clause>]clauseisusedtofurthersubpartitiontheresultandapplytheanalyticfunctions.Therearetwotypesofwindows:rowtypewindowandrangetypewindow.
NoteAccordingtothearticleathttps://issues.apache.org/jira/browse/HIVE-4797,theRANK,NTILE,DENSE_RANK,CUME_DIST,PERCENT_RANK,LEAD,LAG,andROW_NUMBERfunctionsdonotsupportbeingusedwithawindowclauseyet.
Forrowtypewindows,thedefinitionisintermsofrownumbersbeforeorafterthecurrentrow.Thegeneralsyntaxoftherowwindowclauseisasfollows:
ROWSBETWEEN<start_expr>AND<end_expr>
The<start_expr>canbeanyoneofthefollowing:
UNBOUNDEDPRECEDING
CURRENTROW
NPRECEDINGorFOLLOWING
The<end_expr>canbeanyoneofthefollowing:
UNBOUNDEDFOLLOWING
CURRENTROW
NPRECEDINGorFOLLOWING
Thefollowingarethewindowexpressions:
BETWEEN…AND:UsetheBETWEEN…ANDclausetospecifythestartpointandendpointforthewindow.Thefirstexpression(beforeAND)definesthestartpointandthesecondexpression(afterAND)definestheendpoint.IfweomitBETWEEN…AND(suchasROWSNPRECEDINGorROWSUNBOUNDEDPRECEDING),Hiveconsidersitasthestartpoint,andtheendpointdefaultstothecurrentrow(seewin13columnintheupcomingexamples).NPRECEDINGorFOLLOWING:ThisindicatesNrowsbeforeorafterthecurrentrow.UNBOUNDEDPRECEDING:Thisindicatesthatthewindowstartsatthefirstrowofthepartition.Thisisthestartpointspecificationandcannotbeusedasanendpointspecification.UNBOUNDEDFOLLOWING:Thisindicatesthatthewindowendsatthelastrowofthepartition.Thisistheendpointspecificationandcannotbeusedasastartpointspecification.UNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING:Thisindicatesthefirstandlastrowforeveryrow,meaningallrowsinthetable(seewin12columnintheupcomingexamples).CURRENTROW:Asastartpoint,CURRENTROWspecifiesthatthewindowbeginsatthecurrentroworvaluedependingonwhetherwehavespecifiedROWorRANGE(RANGE
www.it-ebooks.info
isintroducedlaterinthischapter).Inthiscase,theendpointcannotbeNPRECEDING.Asanendpoint,CURRENTROWspecifiesthatthewindowendsatthecurrentroworvaluedependingonwhetherwehavespecifiedROWorRANGE.Inthiscase,thestartpointcannotbeNFOLLOWING.
Thefollowingisadiagramthatcanhelpusunderstandtheprecedingdefinitionsmoreclearly:
Windowexpressiondefinition
Thefollowingexamplesimplementthewindowexpressions:
jdbc:hive2://>SELECTname,dept_numASdept,salaryASsal,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWS
.......>BETWEEN2PRECEDINGANDCURRENTROW)win1,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWS
.......>BETWEEN2PRECEDINGANDUNBOUNDEDFOLLOWING)win2,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWS
.......>BETWEEN1PRECEDINGAND2FOLLOWING)win3,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWS
.......>BETWEEN1PRECEDINGAND2PRECEDING)win4,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWS
.......>BETWEEN1FOLLOWINGAND2FOLLOWING)win5,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWS
.......>BETWEENCURRENTROWANDCURRENTROW)win7,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWS
.......>BETWEENCURRENTROWAND1FOLLOWING)win8,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWS
.......>BETWEENCURRENTROWANDUNBOUNDEDFOLLOWING)win9,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWS
www.it-ebooks.info
.......>BETWEENUNBOUNDEDPRECEDINGANDCURRENTROW)win10,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWS
.......>BETWEENUNBOUNDEDPRECEDINGAND1FOLLOWING)win11,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDED
.......>FOLLOWING)win12,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWS2PRECEDING)win13
.......>FROMemployee_contract
.......>ORDERBYdept_num,name;
+-------+----+----+----+----+----+----+----+----+----+----+-----+-----+----
-+-----+
|name|dept|sal
|win1|win2|win3|win4|win5|win7|win8|win9|win10|win11|win12|win13|
+-------+----+----+----+----+----+----+----+----+----+----+-----+-----+----
-+-----+
|Lucy|1000|5500|5500|6400|6400|NULL|6400|5500|5500|6400|5500|5500|6400
|5500|
|Michael|1000|5000|5500|6400|6400|NULL|6400|5000|6400|6400|5500|6400|6400
|5500|
|Steven|1000|6400|6400|6400|6400|NULL|4000|6400|6400|6400|6400|6400|6400
|6400|
|Will|1000|4000|6400|6400|4000|NULL|NULL|4000|4000|4000|6400|6400|6400
|6400|
|Will|1000|4000|6400|6400|6400|NULL|4000|4000|4000|4000|6400|6400|6400
|6400|
|Jess|1001|6000|6000|6400|6400|NULL|6400|6000|6000|6400|6000|6000|6400
|6000|
|Lily|1001|5000|6000|6400|6400|NULL|6400|5000|6400|6400|6000|6400|6400
|6000|
|Mike|1001|6400|6400|6400|6400|NULL|NULL|6400|6400|6400|6400|6400|6400
|6400|
|Richard|1002|8000|8000|8000|8000|NULL|7000|8000|8000|8000|8000|8000|8000
|8000|
|Wei|1002|7000|8000|8000|8000|NULL|5500|7000|7000|7000|8000|8000|8000
|8000|
|Yun|1002|5500|8000|8000|7000|NULL|NULL|5500|5500|5500|8000|8000|8000
|8000|
+-------+----+----+----+----+----+----+----+----+----+----+-----+-----+----
-+-----+
11rowsselected(168.732seconds)
Fromtheprecedingexample,wecanseethatthewin4columnisNULL.Thisisbecausetherowspecifiedby<start_expr>mustbesmallerthantherowspecifiedby<end_expr>.However,ifwetrytofixitbyreorderingit,especiallywhenusingthePRECEDINGkeyword,itreportsthefollowingexceptionsandthesamethingappliestoUNBOUNDEDPRECEDING.Thisisanissue(https://issues.apache.org/jira/browse/HIVE-9412)forHivewindowingrightnow:
jdbc:hive2://>SELECTname,dept_num,salary,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWS
.......>BETWEEN2PRECEDINGAND1PRECEDING)win4_alter
.......>FROMemployee_contract
www.it-ebooks.info
.......>ORDERBYdept_num,name;
Error:Errorwhilecompilingstatement:FAILED:SemanticExceptionFailedto
breakupWindowinginvocationsintoGroups.Atleast1groupmustonly
dependoninputcolumns.Alsocheckforcirculardependencies.
Underlyingerror:Windowrangeinvalid,startboundaryisgreaterthanend
boundary:window(start=range(2PRECEDING),end=range(1PRECEDING))
(state=42000,code=40000)
jdbc:hive2://>SELECTname,dept_num,salary,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>nameROWS
.......>BETWEENUNBOUNDEDPRECEDINGAND1PRECEDING)win1
.......>FROMemployee_contract
.......>ORDERBYdept_num,name;
Error:Errorwhilecompilingstatement:FAILED:SemanticExceptionEndofa
WindowFramecannotbeUNBOUNDEDPRECEDING(state=42000,code=40000)
Inaddition,windowscanbedefinedinaseparateWINDOWclauseorreferredbyotherwindows,asfollows:
jdbc:hive2://>SELECTname,dept_num,salary,
.......>MAX(salary)OVERw1ASwin1,
.......>MAX(salary)OVERw1ASwin2,
.......>MAX(salary)OVERw1ASwin3
.......>FROMemployee_contract
.......>ORDERBYdept_num,name
.......>WINDOW
.......>w1AS(PARTITIONBYdept_numORDERBYname
.......>ROWSBETWEEN2PRECEDINGANDCURRENTROW),
.......>w2ASw3,
.......>w3AS(PARTITIONBYdept_numORDERBYname
.......>ROWSBETWEEN1PRECEDINGAND2FOLLOWING);
+----------+-----------+---------+-------+-------+-------+
|name|dept_num|salary|win1|win2|win3|
+----------+-----------+---------+-------+-------+-------+
|Lucy|1000|5500|5500|5500|5500|
|Michael|1000|5000|5500|5500|5500|
|Steven|1000|6400|6400|6400|6400|
|Will|1000|4000|6400|6400|6400|
|Will|1000|4000|6400|6400|6400|
|Jess|1001|6000|6000|6000|6000|
|Lily|1001|5000|6000|6000|6000|
|Mike|1001|6400|6400|6400|6400|
|Richard|1002|8000|8000|8000|8000|
|Wei|1002|7000|8000|8000|8000|
|Yun|1002|5500|8000|8000|8000|
+----------+-----------+---------+-------+-------+-------+
11rowsselected(156.902seconds)
Comparedtorowtypewindowsintermsofrows,therangetypewindowsareintermsofvaluesbeforeorafterthecurrentORDERBYcolumn,whichmustbeanumberordatetype.Fornow,onlyoneORDERBYcolumnissupportedbyrangetypewindows.
jdbc:hive2://>SELECTname,salary,start_year,
.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY
.......>start_yearRANGE
www.it-ebooks.info
.......>BETWEEN2PRECEDINGANDCURRENTROW)win1
.......>FROM
.......>(
.......>SELECTname,salary,dept_num,
.......>YEAR(start_date)ASstart_year
.......>FROMemployee_contract
.......>)a;
+----------+---------+-------------+-------+
|name|salary|start_year|win1|
+----------+---------+-------------+-------+
|Lucy|5500|2010|5500|
|Steven|6400|2012|6400|
|Will|4000|2013|6400|
|Will|4000|2014|6400|
|Michael|5000|2014|6400|
|Mike|6400|2013|6400|
|Jess|6000|2014|6400|
|Lily|5000|2014|6400|
|Wei|7000|2010|7000|
|Richard|8000|2013|8000|
|Yun|5500|2014|8000|
+----------+---------+-------------+-------+
11rowsselected(92.035seconds)
NoteIfweomitthewindowingclauseentirely,thedefaultwindowisRANGEBETWEENUNBOUNDEDPRECEDINGANDCURRENTROW.
www.it-ebooks.info
www.it-ebooks.info
SamplingWhendatavolumeisextralarge,wemayneedtofindasubsetofdatatospeedupdataanalysis.Hereitcomestoatechniqueusedtoselectandanalyzeasubsetofdatainordertoidentifypatternsandtrends.InHive,therearethreewaysofsamplingdata:randomsampling,buckettablesampling,andblocksampling.
RandomsamplingusestheRAND()functionandLIMITkeywordtogetthesamplingofdataasshowninthefollowingexample.TheDISTRIBUTEandSORTkeywordsareusedheretomakesurethedataisalsorandomlydistributedamongmappersandreducersefficiently.TheORDERBYRAND()statementcanalsoachievethesamepurpose,buttheperformanceisnotgood:
SELECT*FROM<Table_Name>DISTRIBUTEBYRAND()SORTBYRAND()
LIMIT<Nrowstosample>;
Buckettablesamplingisaspecialsamplingoptimizedforbuckettablesasshowninthefollowingsyntaxandexample.Thecolnamevaluespecifiesthecolumnwheretosamplethedata.TheRAND()functioncanalsobeusedwhensamplingisontheentirerows.IfthesamplecolumnisalsotheCLUSTEREDBYcolumn,theTABLESAMPLEstatementwillbemoreefficient.
--Syntax
SELECT*FROM<Table_Name>
TABLESAMPLE(BUCKET<specifiedbucketnumbertosample>OUTOF<totalnumber
ofbuckets>ON[colname|RAND()])table_alias;
--Anexample
jdbc:hive2://>SELECTnameFROMemployee_id_buckets
.......>TABLESAMPLE(BUCKET1OUTOF2ONrand())a;
+----------+
|name|
+----------+
|Lucy|
|Shelley|
|Lucy|
|Lucy|
|Shelley|
|Lucy|
|Will|
|Shelley|
|Michael|
|Will|
|Will|
|Will|
|Will|
|Will|
|Lucy|
+----------+
15rowsselected(0.07seconds)
BlocksamplingallowsHivetorandomlypickupNrowsofdata,percentage(n
www.it-ebooks.info
percentage)ofdatasize,orNbytesizeofdata.ThesamplinggranularityistheHDFSblocksize.Itssyntaxandexamplesareasfollows:
--Syntax
SELECT*
FROM<Table_Name>TABLESAMPLE(NPERCENT|ByteLengthLiteral|NROWS)s;
--ByteLengthLiteral
--(Digit)+('b'|'B'|'k'|'K'|'m'|'M'|'g'|'G')
--Samplebyrows
jdbc:hive2://>SELECTname
.......>FROMemployee_id_bucketsTABLESAMPLE(4ROWS)a;
+----------+
|name|
+----------+
|Lucy|
|Shelley|
|Lucy|
|Shelley|
+----------+
4rowsselected(0.055seconds)
--Samplebypercentageofdatasize
jdbc:hive2://>SELECTname
.......>FROMemployee_id_bucketsTABLESAMPLE(10PERCENT)a;
+----------+
|name|
+----------+
|Lucy|
|Shelley|
|Lucy|
+----------+
3rowsselected(0.061seconds)
--Samplebydatasize
jdbc:hive2://>SELECTname
.......>FROMemployee_id_bucketsTABLESAMPLE(3M)a;
+----------+
|name|
+----------+
|Lucy|
|Shelley|
|Lucy|
|Shelley|
|Lucy|
|Shelley|
|Lucy|
|Shelley|
|Lucy|
|Will|
|Shelley|
|Lucy|
|Will|
|Shelley|
|Michael|
www.it-ebooks.info
|Will|
|Shelley|
|Lucy|
|Will|
|Will|
|Will|
|Will|
|Will|
|Lucy|
|Shelley|
+----------+
25rowsselected(0.07seconds)
www.it-ebooks.info
www.it-ebooks.info
SummaryInthischapter,wecoveredhowtoaggregatedatausingbasicaggregationfunctions.Then,weintroducedtheadvancedaggregationswithGROUPINGSETS,ROLLUP,andCUBE,aswellasaggregationconditionsusingHAVING.Wealsocoveredthevariousanalyticfunctionsandwindowingclauses.Attheendofthechapter,weintroducedthreewaysofsamplingdatainHive.Aftergoingthroughthischapter,youshouldbeabletodobasicandadvancedaggregationsanddatasamplinginHive.
Inthenextchapter,we’lltalkaboutperformanceconsiderationsinHive.
www.it-ebooks.info
www.it-ebooks.info
Chapter7.PerformanceConsiderationsAlthoughHiveisbuilttodealwithbigdata,westillcannotignoretheimportanceofperformance.Mostofthetime,abetterHivequerycanrelyonthesmartqueryoptimizertofindthebestexecutionstrategyaswellasthedefaultsettingbestpracticefromvendorpackages.However,asexperiencedusers,weshouldlearnmoreaboutthetheoryandpracticeofperformancetuninginHive,especiallywhenworkinginaperformance-basedprojectorenvironment.Inthischapter,wewillstartfromutilitiesavailableinHivetofindpotentialissuescausingpoorperformance.Then,weintroducethebestpracticesofperformanceconsiderationsintheareasofdesign,fileformat,compression,storage,query,andjob.
Inthischapter,wewillcoverthefollowingtopics:
PerformanceutilitiesDesignoptimizationDatafileoptimizationJobandqueryoptimization
www.it-ebooks.info
PerformanceutilitiesHiveprovidestheEXPLAINandANALYZEstatementsthatcanbeusedasutilitiestocheckandidentifytheperformanceofqueries.
www.it-ebooks.info
TheEXPLAINstatementHiveprovidesanEXPLAINcommandtoreturnaqueryexecutionplanwithoutrunningthequery.WecanuseanEXPLAINcommandforqueriesifwehaveadoubtoraconcernaboutperformance.TheEXPLAINcommandwillhelptoseethedifferencebetweentwoormorequeriesforthesamepurpose.ThesyntaxforEXPLAINisasfollows:
EXPLAIN[EXTENDED|DEPENDENCY|AUTHORIZATION]hive_query
Thefollowingkeywordscanbeused:
EXTENDED:Thisprovidesadditionalinformationfortheoperatorsintheplan,suchasfilepathnameandabstractsyntaxtree.DEPENDENCY:ThisprovidesaJSONformatoutputthatcontainsalistoftablesandpartitionsthatthequerydependson.ItisavailablesinceHIVE0.10.0.AUTHORIZATION:ThislistsallentitiesneededtobeauthorizedincludinginputandoutputtoruntheHivequeryandauthorizationfailures,ifany.ItisavailablesinceHIVE0.14.0.
Atypicalqueryplancontainsthefollowingthreesections.Wewillalsohavealookatanexamplelater:
Abstractsyntaxtree(AST):HiveusesapacergeneratorcalledANTLR(seehttp://www.antlr.org/)toautomaticallygenerateatreeofsyntaxforHQL.Wecanusuallyignorethismostofthetime.Stagedependencies:Thislistsalldependenciesandnumberofstagesusedtorunthequery.Stageplans:Itcontainsimportantinformation,suchasoperatorsandsortorders,forrunningthejob.
Thefollowingiswhatatypicalqueryplanlookslike.Fromthefollowingexample,wecanseethattheASTsectionisnotshownsincetheEXTENDEDkeywordisnotusedwithEXPLAIN.IntheSTAGEDEPENDENCIESsection,bothStage-0andStage-1areindependentrootstages.IntheSTAGEPLANSsection,Stage-1hasonemapandreducereferredtobyMapOperatorTreeandReduceOperatorTree.InsideeachMap/ReduceOperatorTreesection,alloperatorscorrespondingtoHivequerykeywordsaswellasexpressionsandaggregationsarelisted.TheStage-0stagedoesnothavemapandreduce.ItisjustaFetchoperation.
jdbc:hive2://>EXPLAINSELECTsex_age.sex,count(*)
.......>FROMemployee_partitioned
.......>WHEREyear=2014GROUPBYsex_age.sexLIMIT2;
+--------------------------------------------------------------------------
---+
|Explain
|
+--------------------------------------------------------------------------
---+
|STAGEDEPENDENCIES:
|
www.it-ebooks.info
|Stage-1isarootstage
|
|Stage-0isarootstage
|
|
|
|STAGEPLANS:
|
|Stage:Stage-1
|
|MapReduce
|
|MapOperatorTree:
|
|TableScan
|
|alias:employee_partitioned
|
|Statistics:Numrows:0Datasize:227Basicstats:PARTIAL
|
|Columnstats:NONE
|
|SelectOperator
|
|expressions:sex_age(type:struct<sex:string,age:int>)
|
|outputColumnNames:sex_age
|
|Statistics:Numrows:0Datasize:227Basicstats:PARTIAL
|
|Columnstats:NONE
|
|GroupByOperator
|
|aggregations:count()
|
|keys:sex_age.sex(type:string)
|
|mode:hash
|
|outputColumnNames:_col0,_col1
|
|Statistics:Numrows:0Datasize:227Basic
stats:PARTIAL|
|Columnstats:NONE
|
|ReduceOutputOperator
|
|keyexpressions:_col0(type:string)
|
|sortorder:+
|
|Map-reducepartitioncolumns:_col0(type:string)
|
|Statistics:Numrows:0Datasize:227Basic
stats:PARTIAL|
www.it-ebooks.info
|Columnstats:NONE
|
|valueexpressions:_col1(type:bigint)
|
|ReduceOperatorTree:
|
|GroupByOperator
|
|aggregations:count(VALUE._col0)
|
|keys:KEY._col0(type:string)
|
|mode:mergepartial
|
|outputColumnNames:_col0,_col1
|
|Statistics:Numrows:0Datasize:0Basicstats:NONE
|
|Columnstats:NONE
|
|SelectOperator
|
|expressions:_col0(type:string),_col1(type:bigint)
|
|outputColumnNames:_col0,_col1
|
|Statistics:Numrows:0Datasize:0Basicstats:NONE
|
|Columnstats:NONE
|
|Limit
|
|Numberofrows:2
|
|Statistics:Numrows:0Datasize:0Basicstats:NONE
|
|Columnstats:NONE
|
|FileOutputOperator
|
|compressed:false
|
|Statistics:Numrows:0Datasize:0Basicstats:NONE
|
|Columnstats:NONE
|
|table:
|
|inputformat:
org.apache.hadoop.mapred.TextInputFormat|
|output
format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat|
|
serde:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe||
|
|Stage:Stage-0
www.it-ebooks.info
|
|FetchOperator
|
|limit:2
|
+--------------------------------------------------------------------------
---+
53rowsselected(0.26seconds)
www.it-ebooks.info
TheANALYZEstatementHivestatisticsareacollectionofdatathatdescribemoredetails,suchasthenumberofrows,numberoffiles,andrawdatasize,ontheobjectsintheHivedatabase.StatisticsisametadataofHivedata.Hivesupportsstatisticsatthetable,partition,andcolumnlevel.ThesestatisticsserveasaninputtotheHiveCost-BasedOptimizer(CBO),whichisanoptimizertopickthequeryplanwiththelowestcostintermsofsystemresourcesrequiredtocompletethequery.
ThestatisticsaregatheredthroughtheANALYZEstatementsinceHive0.10.0ontables,partitions,andcolumnsasgiveninthefollowingexamples:
jdbc:hive2://>ANALYZETABLEemployeeCOMPUTESTATISTICS;
Norowsaffected(27.979seconds)
jdbc:hive2://>ANALYZETABLEemployee_partitioned
.......>PARTITION(year=2014,month=12)COMPUTESTATISTICS;
Norowsaffected(45.054seconds)
jdbc:hive2://>ANALYZETABLEemployee_idCOMPUTESTATISTICS
.......>FORCOLUMNSemployee_id;
Norowsaffected(41.074seconds)
Oncethestatisticsarebuilt,wecancheckthestatisticsbytheDESCRIBEEXTENDED/FORMATTEDstatement.Fromthetable/partitionoutput,wecanfindthestatisticsinformationinsidetheparameters,suchasparameters:{numFiles=1,COLUMN_STATS_ACCURATE=true,transient_lastDdlTime=1417726247,numRows=4,
totalSize=227,rawDataSize=223}).Thefollowingisanexample:
jdbc:hive2://>DESCRIBEEXTENDEDemployee_partitioned
.......>PARTITION(year=2014,month=12);
jdbc:hive2://>DESCRIBEEXTENDEDemployee;
…
parameters:{numFiles=1,COLUMN_STATS_ACCURATE=true,
transient_lastDdlTime=1417726247,numRows=4,totalSize=227,
rawDataSize=223}).
jdbc:hive2://>DESCRIBEFORMATTEDemployee.name;
+--------+---------+---+---+---------+--------------+-----------+----------
-+
|col_name|data_type|min|max|num_nulls|distinct_count|avg_col_len|max_col_le
n|
+--------+---------+---+---+---------+--------------+-----------+----------
-+
|name|string|||0|5|5.6|7
|
+--------+---------+---+---+---------+--------------+-----------+----------
-+
+---------+----------+-----------------+
|num_trues|num_falses|comment|
+---------+----------+-----------------+
|||fromdeserializer|
www.it-ebooks.info
+---------+----------+-----------------+
3rowsselected(0.116seconds)
Hivestatisticsarepersistedinthemetastoretoavoidcomputingthemeverytime.Fornewlycreatedtablesand/orpartitions,statisticsareautomaticallycomputedbydefaultifweenablethefollowingsetting:
jdbc:hive2://>SEThive.stats.autogather=ture;
NoteHivelogs
LogsprovideusefulinformationtofindouthowaHivequery/jobruns.BycheckingtheHivelogs,wecanidentifyruntimeproblemsandissuesthatmaycausebadperformance.TherearetwotypesoflogsavailableinHive:systemlogandjoblog.
ThesystemlogcontainstheHiverunningstatusandissues.Itisconfiguredin{HIVE_HOME}/conf/hive-log4j.properties.ThefollowingthreelinesforHivelogcanbefound:
hive.root.logger=WARN,DRFA
hive.log.dir=/tmp/${user.name}
hive.log.file=hive.log
Tomodifythestatus,wecaneithermodifytheprecedinglinesinhive-log4j.properties(appliestoallusers)orsetfromtheHiveCLI(onlyappliestothecurrentuserandcurrentsession)asfollows:
hive--hiveconfhive.root.logger=DEBUG,console
ThejoblogcontainsHivequeryinformationandissavedatthesameplace,/tmp/${user.name},bydefaultasonefileforeachHiveusersession.Wecanoverrideitinhive-site.xmlwiththehive.querylog.locationproperty.IfaHivequerygeneratesMapReducejobs,thoselogscanalsobeviewedthroughtheHadoopJobTrackerWebUI.
www.it-ebooks.info
www.it-ebooks.info
DesignoptimizationDesignoptimizationcoversseveraldatalayoutanddesignstrategiestoimproveperformance.
www.it-ebooks.info
PartitiontablesHivepartitioningisoneofthemosteffectivemethodstoimprovethequeryperformanceonlargertables.Thequerywithpartitionfilteringwillonlyloadthedatainthespecifiedpartitions(subdirectories),soitcanexecutemuchfasterthananormalquerythatfiltersbyanon-partitioningfield.Theselectionofpartitionkeyisalwaysanimportantfactorforperformance.Itshouldalwaysbealowcardinalattributetoavoidmanysubdirectoriesoverhead.
Thefollowingaresomecommonlyuseddimensionsaspartitionkeys:
Partitionsbydateandtime:Usedateandtime,suchasyear,month,andday(evenhours),aspartitionkeyswhendataisassociatedwiththetimedimensionPartitionsbylocations:Usecountry,territory,state,andcityaspartitionkeyswhendataislocationrelatedPartitionsbybusinesslogics:Usedepartment,salesregion,applications,customers,andsoonaspartitionedkeyswhendatacanbeseparatedevenlybysomebusinesslogic
www.it-ebooks.info
BuckettablesSimilartopartitioning,abuckettableorganizesdataintoseparatefilesintheHDFS.BucketingcanspeedupthedatasamplinginHivewithsamplingonbuckets.Bucketingcanalsoimprovethejoinperformanceifthejoinkeysarealsobucketkeysbecausebucketingensuresthatthekeyispresentinacertainbucket.MoredetailsaregivenintheJobandQueryoptimizationsectioninthischapter.
www.it-ebooks.info
IndexIndexisverycommonwithRDBMSwhenwewanttospeedaccesstoacolumnorsetofcolumns.Hivesupportsindexcreationontables/partitionssinceHive0.7.0.TheindexinHiveprovideskey-baseddataviewandbetterdataaccessforcertainoperations,suchasWHERE,GROUPBY,andJOIN.Wecanuseindexisacheaperalternativethanfulltablescans.ThecommandtocreateanindexinHiveisstraightforwardasfollows:
jdbc:hive2://>CREATEINDEXidx_id_employee_id
.......>ONTABLEemployee_id(employee_id)
.......>AS'COMPACT'
.......>WITHDEFERREDREBUILD;
Norowsaffected(1.149seconds)
InadditiontotheCOMPACTkeyword(referstoorg.apache.hadoop.hive.ql.index.compact.CompactIndexHandler)usedintheprecedingexample,HivealsosupportsBITMAPindexessinceHIVE0.8.0forcolumnswithlessdifferentvalues,asshowninthefollowingexample:
jdbc:hive2://>CREATEINDEXidx_sex_employee_id
.......>ONTABLEemployee_id(sex_age)
.......>AS'BITMAP'
.......>WITHDEFERREDREBUILD;
Norowsaffected(0.251seconds)
TheWITHDEFERREDREBUILDkeywordintheprecedingexamplepreventstheindexfromimmediatelybeingbuilt.Tobuildtheindex,wecanissueALTER…REBUILDcommandsasinthefollowingexample.Whendatainthebasetablechanges,theALTER…REBUILDcommandmustbeusedtobringtheindexuptodate.Thisisanatomicoperation,soiftheindexrebuiltonatablethathasbeenpreviouslyindexedfailed,thestateofindexremainsthesame,asshownhere:
jdbc:hive2://>ALTERINDEXidx_id_employee_idONemployee_idREBUILD;
Norowsaffected(111.413seconds)
jdbc:hive2://>ALTERINDEXidx_sex_employee_idONemployee_id
.......>REBUILD;
Norowsaffected(82.23seconds)
Oncetheindexisbuilt,Hivewillcreateanewindextableforeachindexasfollows:
jdbc:hive2://>!table
+-----------+------------------------------------------+-----------+-------
+
|TABLE_SCHEM|TABLE_NAME|
TABLE_TYPE|REMARKS|
+-----------+------------------------------------------+-----------+-------
+
|default|default__employee_id_idx_id_employee_id__|INDEX_TABLE|NULL
|
|default|default__employee_id_idx_sex_employee_id__|INDEX_TABLE|NULL
|
+-----------+------------------------------------------+-----------+-------
www.it-ebooks.info
+
Theindextablewillhavenameconventionsuchasdefault__tablename_indexname__.Itcontainstheindexedcolumn,the_bucketname(typicalfileURIonHDFS),and_offsets(offsetsforeachrows).Then,thisindextablecanbeusedwhereweneedtoquerytheindexedcolumnslikearegulartable,asshownhere:
jdbc:hive2://>DESCdefault__employee_id_idx_id_employee_id__;
+--------------+----------------+----------+
|col_name|data_type|comment|
+--------------+----------------+----------+
|employee_id|int||
|_bucketname|string||
|_offsets|array<bigint>||
+--------------+----------------+----------+
3rowsselected(0.135seconds)
Todropanindex,wecanusetheDROPINDEXindex_nameONtable_namestatementasfollows.However,wecannotdroptheindextablewithaDROPTABLEstatement:
jdbc:hive2://>DROPINDEXidx_sex_employee_idONemployee_id;
Norowsaffected(0.247seconds)
NoteSinceHive0.13.0,Hiveincludesthefollowingnewfeaturesforperformanceoptimizations:
Tez:Tez(http://tez.apache.org/)isanapplicationframeworkbuiltonYarnthatcanexecutecomplexdirectedacyclicgraphs(DAGs)forgeneraldata-processingtasks.Tezfurthersplitsmapandreducejobsintosmallertasksandcombinestheminaflexibleandefficientwayforexecution.TezisconsideredaflexibleandpowerfulsuccessortotheMapReduceframework.ToconfigureHivetouseTez,weneedtooverwritethefollowingsettingsfromthedefaultMapReduce:
SEThive.execution.engine=tez;
Vectorization:Vectorizationoptimizationprocessesalargerbatchofdataatthesametimeratherthanonerowatatime,thussignificantlyreducingcomputingoverhead.Eachbatchconsistsofacolumnvectorthatisusuallyanarrayofprimitivetypes.Operationsareperformedontheentirecolumnvector,whichimprovestheinstructionpipelinesandcacheusage.FilesmustbestoredintheOptimizedRowColumnar(ORC)formatinordertousevectorization.Formoreonvectorization,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution.Toenablevectorization,weneedtodothefollowingsetting:
SEThive.vectorized.execution.enabled=true;
www.it-ebooks.info
www.it-ebooks.info
DatafileoptimizationDatafileoptimizationcoverstheperformanceimprovementonthedatafilesintermsoffileformat,compression,andstorage.
www.it-ebooks.info
FileformatHivesupportsTEXTFILE,SEQUENCEFILE,RCFILE,ORC,andPARQUETfileformats.Thethreewaystospecifythefileformatareasfollows:
CREATETABLE…STOREAS<File_Format>
ALTERTABLE…[PARTITIONpartition_spec]SETFILEFORMAT<File_Format>
SEThive.default.fileformat=<File_Format>--defaultfileformatfor
table
Here,<File_Type>isTEXTFILE,SEQUENCEFILE,RCFILE,ORC,andPARQUET.
WecanloadatextfiledirectlytoatablewiththeTEXTFILEformat.Toloaddatatothetablewithotherfileformats,weneedtoloadthedatatoaTEXTFILEformattablefirst.Then,useINSERTOVERWRITETABLE<target_file_format_table>SELECT*FROM<text_format_source_table>toconvertandinsertthedatatothefileformatasexpected.
ThefileformatssupportedbyHiveandtheiroptimizationsareasfollows:
TEXTFILE:ThisisthedefaultfileformatforHive.Dataisnotcompressedinthetextfile.Itcanbecompressedwithcompressiontools,suchasGZip,Bzip2,andSnappy.However,thesecompressedfilesarenotsplittableasinputduringprocessing.Asaresult,itleadstorunningasingle,hugemapjobtoprocessonebigfile.SEQUENCEFILE:Thisisabinarystorageformatforkey/valuepairs.ThebenefitofasequencefileisthatitismorecompactthanatextfileandfitswellwiththeMapReduceoutputformat.Sequencefilescanbecompressedonrecordorblocklevelwhereblocklevelhasabettercompressionratio.Toenableblocklevelcompression,weneedtodothefollowingsettings:
jdbc:hive2://>SEThive.exec.compress.output=true;
jdbc:hive2://>SETio.seqfile.compression.type=BLOCK;
Unfortunately,bothtextandsequencefilesasarowlevelstoragefileformatarenotanoptimalsolutionsinceHivehastoreadafullrowevenifonlyonecolumnisbeingrequested.Forinstance,ahybridrow-columnarstoragefileformat,suchasRCFILE,ORC,andPARQUETimplementation,iscreatedtoresolvethisproblem.
RCFILE:ThisisshortforRecordColumnarFile.Itisaflatfileconsistingofbinarykey/valuepairsthatsharesmuchsimilaritywithasequencefile.TheRCFilesplitsdatahorizontallyintorowgroups.OneorseveralgroupsarestoredinanHDFSfile.Then,RCFilesavestherowgroupdatainacolumnarformatbysavingthefirstcolumnacrossallrows,thenthesecondcolumnacrossallrows,andsoon.ThisformatissplittableandallowsHivetoskipirrelevantpartsofdataandgettheresultsfasterandcheaper.ORC:ThisisshortforOptimizedRowColumnar.ItisavailablesinceHive0.11.0.TheORCformatcanbeconsideredanimprovedversionofRCFILE.Itprovidesalargerblocksizeof256MBbydefault(RCFILEhas4MBandSEQUENCEFILEhas1MB)optimizedforlargesequentialreadsonHDFSformorethroughputandfewerfilesto
www.it-ebooks.info
reduceoverloadinthenamenode.DifferentfromRCFILEthatreliesonmetastoretoknowdatatypes,theORCfileunderstandsthedatatypesbyusingspecificencoderssothatitcanoptimizecompressiondependingondifferenttypes.Italsostoresbasicstatistics,suchasMIN,MAX,SUM,andCOUNT,oncolumnsaswellasalightweightindexthatcanbeusedtoskipblocksofrowsthatdonotmatter.PARQUET:ThisisanotherrowcolumnarfileformatthathasasimilardesigntothatofORC.What’smore,ParquethasawiderrangeofsupportforthemajorityprojectsintheHadoopecosystemcomparedtoORCthatonlysupportsHiveandPig.ParquetleveragesthedesignbestpracticesofGoogle’sDremel(seehttp://research.google.com/pubs/pub36632.html)tosupportthenestedstructureofdata.ParquetissupportedbyapluginsinceHive0.10.0andhasgotnativesupportsinceHive0.13.0.
ConsideringthematurityofHive,itissuggestedtousetheORCformatifHiveisthemainmajoritytoolusedinyourHadoopenvironment.IfyouuseseveraltoolsintheHadoopecosystem,PARQUETisabetterchoiceintermsofadaptability.
NoteHadoopArchiveFile(HAR)isanothertypeoffileformattopackHDFSfilesintoarchives.Thisisanoption(notagoodoption)forstoringalargenumberofsmall-sizedfilesinHDFS,asstoringalargenumberofsmall-sizedfilesdirectlyinHDFSisnotveryefficient.However,HARstillhassomelimitationsthatmakeitunpopular,suchasimmutablearchiveprocess,notbeingsplittable,andcompatibilityissues.FormoreinformationaboutHARandarchiving,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+Archiving.
www.it-ebooks.info
CompressionCompressiontechniquesinHivecansignificantlyreducetheamountofdatatransferringbetweenmappersandreducersbyproperintermediateoutputcompressionaswellasoutputdatasizeinHDFSbyoutputcompression.Asaresult,theoverallHivequerywillhavebetterperformance.TocompressintermediatefilesproducedbyHivebetweenmultipleMapReducejobs,weneedtosetthefollowingproperty(falsebydefault)intheHiveCLIorthehive-site.xmlfile:
jdbc:hive2://>SEThive.exec.compress.intermediate=true
Then,weneedtodecidewhichcompressioncodectoconfigure.AlistofcommoncodecssupportedinHadoopandHiveisasfollows:
Compression Codec Extension Splittable
Deflate org.apache.hadoop.io.compress.DefaultCodec .deflate N
GZip org.apache.hadoop.io.compress.GzipCodec .gz N
Bzip2 org.apache.hadoop.io.compress.BZip2Codec .gz Y
LZO com.hadoop.compression.lzo.LzopCodec .lzo N
LZ4 org.apache.hadoop.io.compress.Lz4Codec .lz4 N
Snappy org.apache.hadoop.io.compress.SnappyCodec .snappy N
Hadoophasadefaultcodec(.deflate).ThecompressionratioforGZipishigheraswellasitsCPUcost.Bzip2issplittable,butsplittingisn’tsupportedbyHadoopuntil1.1(seehttps://issues.apache.org/jira/browse/HADOOP-4012).Inaddition,Bzip2istooslowforcompressionconsideringitshugeCPUcost.LZOfilesarenotnativelysplittable.Butwecanpreprocessthem(usingcom.hadoop.compression.lzo.LzoIndexer)tocreateanindexthatdeterminesthefilesplits.WhenitcomestothebalanceofCPUcostandcompressionratio,LZ4orSnappydoabetterjob.Sincethemajorityofcodecdonotsupportsplitaftercompression,itissuggestedtoavoidcompressingbigfilesinHDFS.
Thecompressioncodeccanbespecifiedineithermapred-site.xml,hive-site.xml,orHiveCLI,asinthefollowingexample:
jdbc:hive2://>SEThive.intermediate.compression.codec=
.......>org.apache.hadoop.io.compress.SnappyCodec
Intermediatecompressionwillonlysavediskspaceforspecificjobsthatrequiremultiplemapandreducejobs.Forfurthersavingofdiskspace,theactualHiveoutputfilescanbecompressed.Whenthehive.exec.compress.outputpropertyissettotrue,Hivewillusethecodecconfiguredbythemapred.map.output.compression.codecpropertytocompressthestorageinHDFSasfollows.Thesepropertiescanbesetinthehive-site.xmlorintheHiveCLI.
www.it-ebooks.info
jdbc:hive2://>SEThive.exec.compress.output=true
jdbc:hive2://>SETmapred.output.compression.codec=
.......>org.apache.hadoop.io.compress.SnappyCodec
www.it-ebooks.info
StorageoptimizationThedata,whichisusedorscannedfrequently,canbeidentifiedashotdata.Usually,thequeryperformanceonthehotdataiscriticalforoverallperformance.IncreasingthedatareplicationfactorinHDFS(seethefollowingexample)forhotdatacouldincreasethechanceofdatabeinghitlocallybyHivejobsandimprovetheperformance.However,thisisatrade-offforstorage.
$hdfsdfs-setrep-R-w4/user/hive/warehouse/employee
Replication4set:/user/hive/warehouse/employee/000000_0
Ontheotherhand,toomanyfilesorredundancycouldmakenamenode’smemoryexhausted,especiallyforlotsofsmallfileslessthantheHDFSblocksizes.Hadoopitselfalreadyhassomesolutionstodealwithtoomanysmall-fileissues,suchasthefollowing:
HadoopArchiveandHAR:Thesearetoolkitstopacksmallfiles.SequenceFileformat:Thisisaformattocompresssmallfilestobiggerfiles.CombineFileInputFormat:AtypeofInputFormattocombinesmallfilesbeforemapandreduceprocessing.ItisthedefaultInputFormatforHive(seehttps://issues.apache.org/jira/browse/HIVE-2245).HDFSfederation:Itmakesnamenodesextensibleandpowerfultomanagemorefiles.
WecanalsoleverageothertoolsintheHadoopecosystemifwehavetheminstalled,suchasthefollowing:
HBasehasasmallerblocksizeandbetterfileformattodealwithsmaller-fileaccessissuesFlumeNGcanbeusedaspipestomergesmallfilestobigonesAscheduledofflinefilemergeprogramtomergesmallfilesinHDFSorbeforeloadingthemtoHDFS
ForHive,wecandothefollowingconfigurationsformergingfilesofqueryresultstoavoidrecreatingsmallfiles:
hive.merge.mapfiles:Thismergessmallfilesattheendofamap-onlyjob.Bydefault,itistrue.hive.merge.mapredfiles:ThismergessmallfilesattheendofaMapReducejob.Setittotruesinceitsdefaultisfalse.hive.merge.size.per.task:Thisdefinesthesizeofmergedfilesattheendofthejob.Thedefaultvalueis256,000,000.hive.merge.smallfiles.avgsize:Thisisthethresholdfortriggeringfilemerge.Thedefaultvalueis16,000,000.
Whentheaverageoutputfilesizeofajobislessthanthevaluespecifiedbyhive.merge.smallfiles.avgsize,andbothhive.merge.mapfiles(formap-onlyjobs)andhive.merge.mapredfiles(forMapReducejobs)aresettotrue,HivewillstartanadditionalMapReducejobtomergetheoutputfilesintobigfiles.
www.it-ebooks.info
www.it-ebooks.info
JobandqueryoptimizationJobandqueryoptimizationcoversexperienceandskillstoimproveperformanceintheareaofjob-runningmode,JVMreuse,jobparallelrunning,andqueryoptimizationsinJOIN.
www.it-ebooks.info
LocalmodeHadoopcanruninstandalone,pseudo-distributed,andfullydistributedmode.Mostofthetime,weneedtoconfigureHadooptoruninfullydistributedmode.Whenthedatatoprocessissmall,itisanoverheadtostartdistributeddataprocessingsincethelaunchingtimeofthefullydistributedmodetakesmoretimethanthejobprocessingtime.SinceHive0.7.0,Hivesupportsautomaticconversionofajobtoruninlocalmodewiththefollowingsettings:
jdbc:hive2://>SEThive.exec.mode.local.auto=true;--defaultfalse
jdbc:hive2://>SEThive.exec.mode.local.auto.inputbytes.max=50000000;
jdbc:hive2://>SEThive.exec.mode.local.auto.input.files.max=5;
--default4
Ajobmustsatisfythefollowingconditionstoruninthelocalmode:
Thetotalinputsizeofthejobislowerthanhive.exec.mode.local.auto.inputbytes.max
Thetotalnumberofmaptasksislessthanhive.exec.mode.local.auto.input.files.max
Thetotalnumberofreducetasksrequiredis1or0
www.it-ebooks.info
JVMreuseBydefault,HadooplaunchesanewJVMforeachmaporreducejobandrunsthemaporreducetaskinparallel.Whenthemaporreducejobisalightweightjobrunningonlyforafewseconds,theJVMstartupprocesscouldbeasignificantoverhead.TheMapReduceframework(version1only,notYarn)hasanoptiontoreuseJVMbysharingtheJVMtorunmapper/reducerseriallyinsteadofparallel.JVMreuseappliestomaporreducetasksinthesamejob.TasksfromdifferentjobswillalwaysruninaseparateJVM.Toenablethereuse,wecansetthemaximumnumberoftasksforasinglejobforJVMreuseusingthemapred.job.reuse.jvm.num.tasksproperty.Itsdefaultvalueis1:
jdbc:hive2://>SETmapred.job.reuse.jvm.num.tasks=5;
Wecanalsosetthevalueto–1toindicatethatallthetasksforajobwillruninthesameJVM.
www.it-ebooks.info
ParallelexecutionHivequeriescommonlyaretranslatedintoanumberofstagesthatareexecutedbythedefaultsequence.Thesestagesarenotalwaysdependentoneachother.Instead,theycanruninparalleltosavetheoveralljobrunningtime.Wecanenablethisfeaturewiththefollowingsettings:
jdbc:hive2://>SEThive.exec.parallel=true;—defaultfalse
jdbc:hive2://>SEThive.exec.parallel.thread.number=16;
--default8,itdefinesthemaxnumberforrunninginparallel
Parallelexecutionwillincreasetheclusterutilization.Iftheutilizationofaclusterisalreadyveryhigh,parallelexecutionwillnothelpmuchintermsofoverallperformance.
www.it-ebooks.info
JoinoptimizationWehavealreadydiscussedoptimizationindifferenttypesofHivejoinsinChapter4,DataSelectionandScope.Here,we’llbrieflyreviewthekeysettingsforjoinimprovement.
CommonjoinThecommonjoinisalsocalledreducesidejoin.ItisabasicjoininHiveandworksformostofthetime.Forcommonjoins,weneedtomakesurethebigtableisontheright-mostsideorspecifiedbyhit,asfollows:
/*+STREAMTABLE(stream_table_name)*/.
MapjoinMapjoinisusedwhenoneofthejointablesissmallenoughtofitinthememory,soitisveryfastbutlimited.SinceHive0.7.0,Hivecanconvertmapjoinautomaticallywiththefollowingsettings:
jdbc:hive2://>SEThive.auto.convert.join=true;--defaultfalse
jdbc:hive2://>SEThive.mapjoin.smalltable.filesize=600000000;
--default25M
jdbc:hive2://>SEThive.auto.convert.join.noconditionaltask=true;
--defaultfalse.Settotruesothatmapjoinhintisnotneeded
jdbc:hive2://>SEThive.auto.convert.join.noconditionaltask.size=10000000;
--Thedefaultvaluecontrolsthesizeoftabletofitinmemory
Onceautoconvertisenabled,Hivewillautomaticallycheckifthesmallertablefilesizeisbiggerthanthevaluespecifiedbyhive.mapjoin.smalltable.filesize,andthenHivewillconvertthejointoacommonjoin.Ifthefilesizeissmallerthanthisthreshold,itwilltrytoconvertthecommonjoinintoamapjoin.Onceautoconvertjoinisenabled,thereisnoneedtoprovidethemapjoinhintsinthequery.
BucketmapjoinBucketmapjoinisaspecialtypeofmapjoinappliedonthebuckettables.Toenablebucketmapjoin,weneedtoenablethefollowingsettings:
jdbc:hive2://>SEThive.auto.convert.join=true;--defaultfalse
jdbc:hive2://>SEThive.optimize.bucketmapjoin=true;--defaultfalse
Inbucketmapjoin,allthejointablesmustbebuckettablesandjoinonbucketscolumns.Inaddition,thebucketsnumberinbiggertablesmustbeamultipleofthebucketnumberinthesmalltables.
Sortmergebucket(SMB)joinSMBisthejoinperformedonthebuckettablesthathavethesamesorted,bucket,andjoinconditioncolumns.Itreadsdatafrombothbuckettablesandperformscommonjoins(mapandreducetriggered)onthebuckettables.WeneedtoenablethefollowingpropertiestouseSMB:
www.it-ebooks.info
jdbc:hive2://>SEThive.input.format=
.......>org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
jdbc:hive2://>SEThive.auto.convert.sortmerge.join=true;
jdbc:hive2://>SEThive.optimize.bucketmapjoin=true;
jdbc:hive2://>SEThive.optimize.bucketmapjoin.sortedmerge=true;
jdbc:hive2://>SEThive.auto.convert.sortmerge.join.noconditionaltask=true;
Sortmergebucketmap(SMBM)joinSMBMjoinisaspecialbucketjoinbuttriggersmap-sidejoinonly.Itcanavoidcachingallrowsinthememorylikemapjoindoes.ToperformSMBMjoins,thejointablesmusthavethesamebucket,sort,andjoinconditioncolumns.Toenablesuchjoins,weneedtoenablethefollowingsettings:
jdbc:hive2://>SEThive.auto.convert.join=true;
jdbc:hive2://>SEThive.auto.convert.sortmerge.join=true
jdbc:hive2://>SEThive.optimize.bucketmapjoin=true;
jdbc:hive2://>SEThive.optimize.bucketmapjoin.sortedmerge=true;
jdbc:hive2://>SEThive.auto.convert.sortmerge.join.noconditionaltask=true;
jdbc:hive2://>SET
hive.auto.convert.sortmerge.join.bigtable.selection.policy=
org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSM
J;
SkewjoinWhenworkingwithdatathathasahighlyunevendistribution,thedataskewcouldhappeninsuchawaythatasmallnumberofcomputenodesmusthandlethebulkofthecomputation.ThefollowingsettinginformsHivetooptimizeproperlyifdataskewhappens:
jdbc:hive2://>SEThive.optimize.skewjoin=true;
--Ifthereisdataskewinjoin,setittotrue.Defaultisfalse.
jdbc:hive2://>SEThive.skewjoin.key=100000;
--Thisisthedefaultvalue.Ifthenumberofkeyisbiggerthan
--this,thenewkeyswillsendtotheotherunusedreducers.
NoteSkewdatacouldhappenontheGROUPBYdatatoo.Tooptimizeit,weneedtodothefollowingsettingstoenableskewdataoptimizationintheGROUPBYresult:
SEThive.groupby.skewindata=true;
Onceconfigured,HivewillfirsttriggeranadditionalMapReducejobwhosemapoutputwillrandomlydistributetothereducertoavoiddataskew.
FormoreinformationaboutHivejoinoptimization,pleaserefertotheApacheHivewikiavailableathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimizationandhttps://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization.
www.it-ebooks.info
www.it-ebooks.info
SummaryInthischapter,wefirstcoveredhowtoidentifyperformancebottlenecksusingtheEXPLAINandANALYZEstatements.Then,wespokeaboutthedesignoptimizationforperformancewhenusingtables,partition,andindex.Wealsocoveredthedatafileoptimizationincludingfileformat,compression,andstorage.Attheendofthischapter,wediscussedjobandqueryoptimizationinHive.Aftergoingthroughthischapter,weshouldbeabletodoperformancetroubleshootingandtuninginHive.
Inthenextchapter,we’lltalkaboutfunctionextensionsforHive.
www.it-ebooks.info
www.it-ebooks.info
Chapter8.ExtensibilityConsiderationsAlthoughHivehasmanybuilt-infunctions,userssometimeswillneedpowerbeyondthatprovidedbybuilt-infunctions.Fortheseinstances,Hiveoffersthefollowingthreemainareaswhereitsfunctionalitiescanbeextended:
User-definedfunction(UDF):Thisprovidesawaytoextendfunctionalitieswithanexternalfunction(mainlywritteninJava)thatcanbeevaluatedinHQLStreaming:Thisplugsinusers’owncustomizedmappersandreducersprogramsinthedatastreamingSerDe:ThisstandsforserializersanddeserializersandprovidesawaytoserializeordeserializeacustomfileformatwithfilesstoredonHDFS
Inthischapter,we’lltalkabouteachoftheminmoredetail.
www.it-ebooks.info
User-definedfunctionsHivedefinesthefollowingthreetypesofUDF:
UDFs:Theseareregularuser-definedfunctionsthatoperaterow-wiseandoutputoneresultforonerow,suchasmostbuilt-inmathematicandstringfunctions.UDAFs:Theseareuser-definedaggregatingfunctionsthatoperaterow-wiseorgroup-wiseandoutputoneroworonerowforeachgroupasaresult,suchastheMAXandCOUNTbuilt-infunctions.UDTFs:Theseareuser-definedtable-generatingfunctionsthatalsooperaterow-wise,buttheyproducemultiplerows/tablesasaresult,suchastheEXPLODEfunction.UDTFcanbeusedeitherafterSELECToraftertheLATERALVIEWstatement.
NoteSinceHiveisimplementedinJava,UDFsshouldbewritteninJavaaswell.SinceJavasupportsrunningcodeinotherlanguagesthroughthejavax.scriptAPI(seehttp://docs.oracle.com/javase/6/docs/api/javax/script/package-summary.html),UDFscanbewritteninlanguagesotherthanJava.Inthisbook,weonlyfocusonJavaUDFs.
We’llstartlookingattheJavacodetemplateforeachkindoffunctioninmoredetail.
www.it-ebooks.info
TheUDFcodetemplateThecodetemplateforaregularUDFisasfollows:
packagecom.packtpub.hive.essentials.hiveudf;
importorg.apache.hadoop.hive.ql.exec.UDF;
importorg.apache.hadoop.hive.ql.exec.Description;
importorg.apache.hadoop.hive.ql.udf.UDFType;
//Belowareoptionsoraddmorewhenneeded
importorg.apache.hadoop.io.Text;
importorg.apache.commons.lang.StringUtils;
@Description(
name="udf_name",
value="_FUNC_(arg1,arg2,...argN)-Ashortdescriptionforthe
function",
extended="Thisismoredetailaboutthefunction,suchassyntax,
examples."
)
@UDFType(deterministic=true,stateful=false)
publicclassudf_nameextendsUDF{
publicStringevaluate(){
/*
*Dosomethinghere
*/
return"returntheudfresult";
}
//overrideissupported
publicStringevaluate(<Type_arg1>arg1,...,<Type_argN>argN){
/*
*Dosomethinghere
*/
return"returntheudfresult";
}
}
Intheprecedingtemplate,thepackagedefinitionandimportsshouldbeself-explanatory.Wecanimportwhateverisneededbesidesthetopthreemandatorylibraries.The@DescriptionannotationisausefulHivespecificannotationtoprovideusageinformationfortheUDFintheHiveconsole.TheinformationdefinedinthevaluepropertywillbeshownintheHQLDESCRIBEFUNCTIONcommand.TheinformationdefinedintheextendedpropertywillbeshownintheHQLDESCRIBEFUNCTIONEXTENDEDcommand.The@UDFTypeannotationtellsHivewhatbehaviortoexpectfromthefunction.AdeterministicUDF(deterministic=true)isafunctionthatalwaysgivesthesameresultwhenpassedthesamearguments,suchasLENGTH(stringinput),MAX(),andsoon.Ontheotherhand,anon-deterministic(deterministic=false)UDFcanreturnadifferentresultforthesamesetofarguments,forexample,UNIX_TIMESTAMP()returningthecurrenttimestampinthedefaulttimezone.Thestateful(stateful=true)propertyallowsfunctionstokeepsomestaticvariablesavailableacrossrows,suchas
www.it-ebooks.info
ROW_NUMBER(),whichassignssequentialnumbersforallrowsinatable.
AllUDFsextendtheHiveUDFclass,sotheUDFsubclassmustimplementtheevaluatemethodcalledbyHive.Theevaluatemethodcanbeoverriddenforadifferentpurpose.Inthismethod,wecanimplementwhateverlogicandexceptionhandlingthedesignforthefunctionusingtheJavaHadooplibraryandtheHadoopdatatypeforMapReducedataserialization,suchasTEXT,DoubleWritable,INTWritable,andsoon.
www.it-ebooks.info
TheUDAFcodetemplateInthissection,weintroducetheUDAFcodetemplatebyextendingitfromtheUDAFclass.Thecodetemplateisasfollows:
packagecom.packtpub.hive.essentials.hiveudaf;
importorg.apache.hadoop.hive.ql.exec.UDAF;
importorg.apache.hadoop.hive.ql.exec.UDAFEvaluator;
importorg.apache.hadoop.hive.ql.exec.Description;
importorg.apache.hadoop.hive.ql.udf.UDFType;
@Description(
name="udaf_name",
value="_FUNC_(arg1,arg2,...argN)-Ashortdescriptionforthe
function",
extended="Thisismoredetailaboutthefunction,suchassyntax,
examples."
)
@UDFType(deterministic=false,stateful=true)
publicfinalclassudaf_nameextendsUDAF{
/**
*Theinternalstateofanaggregationfunction.
*
*Notethatthisisonlyneedediftheinternalstate
*cannotberepresentedbyaprimitive.
*
*Theinternalstatecancontainfieldswithtypeslike
*ArrayList<String>andHashMap<String,Double>ifneeded.
*/
publicstaticclassUDAFState{
private<Type_state1>state1;
private<Type_stateN>stateN;
}
/**
*Theactualclassfordoingtheaggregation.Hivewill
*automaticallylookforallinternalclassesoftheUDAF
*thatimplementsUDAFEvaluator.
*/
publicstaticclassUDAFExampleAvgEvaluatorimplementsUDAFEvaluator{
UDAFStatestate;
publicUDAFExampleAvgEvaluator(){
super();
state=newUDAFState();
init();
}
/**
*Resetthestateoftheaggregation.
*/
publicvoidinit(){
www.it-ebooks.info
/*
*Examplesforinitializingstate.
*/
state.state1=0;
state.stateN=0;
}
/**
*Iteratethroughonerowoforiginaldata.
*
*Thenumberandtypeofargumentsneedtobethesameaswe
*callthisUDAFfromtheHivecommandline.
*
*Thisfunctionshouldalwaysreturntrue.
*/
publicbooleaniterate(<Type_arg1>arg1,...,<Type_argN>argN)
{
/*
*Addlogichereforhowtodoaggregationifthereis
*anewvaluetobeaggregated.
*/
returntrue;
}
/**
*Calledonthemappersideondifferentdatanodes.
*Terminateapartialaggregationandreturnthestate.
*Ifthestateisaprimitive,justreturnprimitiveJava
*classeslikeIntegerorString.
*/
publicUDAFStateterminatePartial(){
/*
*Checkandreturnapartialresultinexpectations.
*/
returnstate;
}
/**
*Mergewithapartialaggregation.
*
*Thisfunctionshouldalwayshaveasingleargument,
*whichhasthesametypeasthereturnvalueof
*terminatePartial().
*/
publicbooleanmerge(UDAFStateo){
/*
*Defineoperationshowtomergetheresultcalculated
*fromalldatanodes.
*/
returntrue;
}
/**
*Terminatestheaggregationandreturnsthefinalresult.
*/
publiclongterminate(){
www.it-ebooks.info
/*
*Checkandreturnfinalresultinexpectations.
*/
returnstate.stateN;
}
}
}
AUDAFmustbeasubclassoforg.apache.hadoop.hive.ql.exec.UDAFcontainingoneormorenestedstaticclassesimplementingorg.apache.hadoop.hive.ql.exec.UDAFEvaluator.MakesurethattheinnerclassthatimplementsUDAFEvaluatorisdefinedaspublic.Otherwise,Hivewon’tbeabletousereflectionanddeterminetheUDAFEvaluatorimplementation.Weshouldalsoimplementthefiverequiredfunctions,init,iterate,terminatePartial,merge,andterminate,alreadydescribedinthecodecomments.
NoteBothUDFandUDAFcanalsobeimplementedbyextendingfromtheGenericUDFandGenericUDAFEvaluatorclassestoavoidusingJavareflectionforbetterperformance.And,thesegenericfunctionsareactuallyextendedbyHive’sbuilt-inUDFsimplementationsinternally.Genericfunctionssupportcomplexdatatypes,suchasMAP,ARRAY,andSTRUCT,asarguments,buttheUDFandUDAFclassdonot.FormoreinformationaboutGenericUDAF,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy.
www.it-ebooks.info
TheUDTFcodetemplateToimplementUDTF,thereisonlyonewaybyextendingfromorg.apache.hadoop.hive.ql.exec.GenericUDTF.ThereisnoplainUDTFclass.Weneedtoimplementthreemethods:initialize,process,andclose.TheUDTFwillcalltheinitializemethod,whichreturnstheinformationofthefunctionoutput,suchasdatatype,numberofoutput,andsoon.Then,theprocessmethodiscalledtodocorefunctionlogicwithargumentsandforwardtheresult.Attheend,theclosemethodwilldoapropercleanup,ifneeded.ThecodetemplateforUDTFisasfollows:
packagecom.packtpub.hive.essentials.hiveudtf;
importorg.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
importorg.apache.hadoop.hive.ql.exec.Description;
importorg.apache.hadoop.hive.ql.exec.UDFArgumentException;
importorg.apache.hadoop.hive.ql.metadata.HiveException;
importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
importorg.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInsp
ectorFactory;
@Description(
name="udtf_name",
value="_FUNC_(arg1,arg2,...argN)-Ashortdescriptionforthe
function",
extended="Thisismoredetailaboutthefunction,suchassyntax,
examples."
)
publicclassudtf_nameextendsGenericUDTF{
privatePrimitiveObjectInspectorstringOI=null;
/**
*Thismethodwillbecalledexactlyonceperinstance.
*Itperformsanycustominitializationlogicweneed.
*Itisalsoresponsibleforverifyingtheinputtypesand
*specifyingtheoutputtypes.
*/
@Override
publicStructObjectInspectorinitialize(ObjectInspector[]args)
throwsUDFArgumentException{
//Checknumberofarguments.
if(args.length!=1){
thrownewUDFArgumentException("TheUDTFshouldtakeexactlyone
argument");
}
/*
*CheckthattheinputObjectInspector[]arraycontainsa
*singlePrimitiveObjectInspectorofthePrimitivetype,
*suchasString.
www.it-ebooks.info
*/
if(args[0].getCategory()!=ObjectInspector.Category.PRIMITIVE
&&
((PrimitiveObjectInspector)args[0]).getPrimitiveCategory()!=
PrimitiveObjectInspector.PrimitiveCategory.STRING){
thrownewUDFArgumentException("TheUDTFshouldtakeastringasa
parameter");
}
stringOI=(PrimitiveObjectInspector)args[0];
/*
*Definetheexpectedoutputforthisfunction,including
*eachaliasandtypesforthealiases.
*/
List<String>fieldNames=newArrayList<String>(2);
List<ObjectInspector>fieldOIs=newArrayList<ObjectInspector>(2);
fieldNames.add("alias1");
fieldNames.add("alias2");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector);
//Setuptheoutputschema.
return
ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,
fieldOIs);
}
/**
*Thismethodiscalledonceperinputrowandgenerates
*output.The"forward"methodisused(insteadof
*"return")inordertospecifytheoutputfromthefunction.
*/
@Override
publicvoidprocess(Object[]record)throwsHiveException{
/*
*Wemayneedtoconverttheobjecttoaprimitivetype
*beforeimplementingcustomizedlogic.
*/
finalStringrecStr=(String)
stringOI.getPrimitiveJavaObject(record[0]);
//emitnewlycreatedstructsafterapplyingcustomizedlogic.
forward(newObject[]{recStr,Integer.valueOf(1)});
}
/**
*Thismethodisforanycleanupthatisnecessarybefore
*returningfromtheUDTF.Sincetheoutputstreamhas
*alreadybeenclosedatthispoint,thismethodcannot
*emitmorerows.
*/
@Override
publicvoidclose()throwsHiveException{
//Donothing.
}
}
www.it-ebooks.info
DevelopmentanddeploymentWe’llgothroughthewholedevelopmentanddeploymentstepsusinganexample.Let’screateaHivefunctioncalledtoUpper,whichwillconvertastringtouppercaseusingthefollowingsteps:
1. DownloadandinstallaJavaIDE,suchasEclipse,fromhttp://www.eclipse.org/downloads/packages/eclipse-ide-java-developers/lunasr1.
2. StarttheIDEandcreateaJavaproject.3. Right-clickontheprojecttochoosetheBuildPath|ConfigureBuildPath|Add
ExternalJarsoption.Itwillopenanewwindow.NavigatetothedirectoryhavingthelibraryofHiveandHadoop.Then,selectandaddallJARfilesneededtoimport.WecanalsoresolvelibrarydependencyautomaticallybyusingMaven(seehttp://maven.apache.org/)andtheproperpom.xmlfile.Howtoconfigurealibraryrepositoryinpom.xmlfilesisusuallywelldescribedintheHadoopvendorpackageorApacheHiveandHadoophelpdocuments.
4. IntheIDE,createthetoupper.javafileasfollows,accordingtotheUDFtemplatementionedpreviously:
packagecom.packtpub.hive.essentials.hiveudf;
importorg.apache.hadoop.hive.ql.exec.UDF;
importorg.apache.hadoop.io.Text;
classToUpperextendsUDF{
publicTextevaluate(Textinput){
if(input==null)returnnull;
returnnewText(input.toString().toUpperCase());
}
}
5. Now,exportthisprojectasaJARfile(orbuiltbyMaven)namedastoupper.jar.6. CopythisJARfileinadirectory,suchas/home/dayongd/hive/lib/,inanodeof
theHivecluster.7. AddtheJARtotheHiveenvironmentusingoneofthefollowingoptions(option3or
4isrecommended):
Option1:RunADDJAR/home/dayongd/hive/lib/toupper.jarintheHiveCLI.Thisisonlyvalidforthecurrentsession,butdoesnotworkforODBCconnections.Option2:AddADDJAR/home/dayongd/hive/lib/toupper.jarin/home/$USER/.hiverc(wecancreatethefileifitisnotthere).Inthiscase,thefileneedstobedeployedtoeverynodefromwherewemightlaunchtheHiveshell.Thisisonlyvalidforthecurrentsession,butdoesnotworkforODBCconnections.Option3:Addthefollowingconfigurationinthehive-site.xmlfile:
<property>
www.it-ebooks.info
<name>hive.aux.jars.path</name>
<value>file:///home/dayongd/hive/lib/toupper.jar</value>
</property>
Option4:CopyandpastetheJARfiletothe/${HIVE_HOME}/auxlib/folder(createitifitdoesnotexist).
8. Createthefunction.WecancreateatemporaryfunctionthatisonlyvalidinthecurrentHivesessionasfollows:
CREATETEMPORARYFUNCTIONtoUpperAS
'com.packtpub.hive.essentials.hiveudf.toupper';
NoteSinceHive0.13.0,wecanuseonecommandtoaddJARandcreatepermanentfunctions,whichisregisteredtothemegastoreandcanbereferencedinaquerywithoutcreatingatemporaryfunctionineachsession:
CREATEFUNCTIONtoUpperAS
'com.packtpub.hive.essentials.hiveudf.ToUpper'USINGJAR
'hdfs:///path/to/jar';
9. Verifythefunction:
SHOWFUNCTIONSToUpper;
DESCRIBEFUNCTIONToUpper;
DESCRIBEFUNCTIONEXTENDEDToUpper;
10. UsetheUDFinHQL:
SELECTtoUpper(name)FROMemployeeLIMIT1000;
11. Dropthefunctionwhenneeded:
DROPTEMPORARYFUNCTIONIFEXISTStoUpper;
www.it-ebooks.info
www.it-ebooks.info
StreamingHivecanalsoleveragethestreamingfeatureinHadooptotransformdatainanalternativeway.ThestreamingAPIopensanI/Opipetoanexternalprocess(script).Then,theprocessreadsdatafromthestandardinputandwritestheresultsoutthroughthestandardoutput.InHive,wecanuseTRANSFORMclausesinHQLdirectly,andembedthemapperandthereducerscriptswrittenincommands,shellscripts,Java,orotherprogramminglanguages.Althoughstreamingbringsoverheadbyusingserialization/deserializationbetweenprocesses,itisasimplercodingmodefordevelopers,especiallynon-Javadevelopers.ThesyntaxoftheTRANSFORMclauseisasfollows:
FROM(
FROMsrc
SELECTTRANSFORM'('expression(','expression)*')'
(inRowFormat)?
USING'map_user_script'
(AScolName(','colName)*)?
(outRowFormat)?(outRecordReader)?
(CLUSTERBY?|DISTRIBUTEBY?SORTBY?)src_alias
)
SELECTTRANSFORM'('expression(','expression)*')'
(inRowFormat)?
USING'reduce_user_script'
(AScolName(','colName)*)?
(outRowFormat)?(outRecordReader)?
Bydefault,theINPUTvaluesfortheuserscriptarethefollowing:
ColumnstransformedtoSTRINGvaluesDelimitedbyatabNULLvaluesconvertedtotheliteralstringN(differentiatesNULLvaluesfromemptystrings)
Bydefault,theOUTPUTvaluesoftheuserscriptarethefollowing:
Treatedastab-separatedSTRINGcolumnsNwillbereinterpretedasNULLTheresultingSTRINGcolumnwillbecasttothedatatypespecifiedinthetabledeclaration
ThesedefaultscanbeoverriddenwithROWFORMAT.AnexampleofHivestreamingusingthePythonscriptupper.pyisasfollows:
#!/usr/bin/envpython
'''
Thisisascripttoupperallcases
'''
importsys
defmain():
try:
forlineinsys.stdin:
www.it-ebooks.info
n=line.strip()
printn.upper()
except:
returnNone
if__name__=="__main__":main()
Testthescript,asfollows:
$echo"Will"|pythonupper.py
$WILL
CallthescriptintheHiveCLIfromHQL:
jdbc:hive2://>ADDFILE/home/dayongd/Downloads/upper.py;
jdbc:hive2://>SELECTTRANSFORM(name,work_place[0])
.......>USING'pythonupper.py'AS(CAP_NAME,CAP_PLACE)
.......>FROMemployee;
+-----------+------------+
|cap_name|cap_place|
+-----------+------------+
|MICHAEL|MONTREAL|
|WILL|MONTREAL|
|SHELLEY|NEWYORK|
|LUCY|VANCOUVER|
|STEVEN|NULL|
+-----------+------------+
5rowsselected(30.101seconds)
NoteTheTRANSFORMcommandisnotallowedwhenSQLstandard-basedauthorizationisconfigured,sinceHive0.13.0.
www.it-ebooks.info
www.it-ebooks.info
SerDeSerDestandsforSerializerandDeserializer.ItisthetechnologythatHiveusestoprocessrecordsandmapthemtocolumndatatypesinHivetables.ToexplainthescenarioofusingSerDe,weneedtounderstandhowHivereadsandwritesdata.
Theprocesstoreaddataisasfollows:
1. DataisreadfromHDFS.2. DataisprocessedbytheINPUTFORMATimplementation,whichdefinestheinputdata
splitandkey/valuerecords.InHive,wecanuseCREATETABLE…STOREDAS<FILE_FORMAT>(seeChapter7,PerformanceConsiderations,foravailablefileformats)tospecifywhichINPUTFORMATitreadsfrom.
3. TheJavaDeserializerclassdefinedinSerDeiscalledtoformatthedataintoarecordthatmapstocolumnanddatatypesinatable.
Foranexampleofreadingdata,wecanuseJSONSerDetoreadtheTEXTFILEformatdatafromHDFSandtranslateeachrowoftheJSONattributeandvaluetorowsinHivetableswiththecorrectschema.
Theprocesstowritedataisasfollows:
1. Data(suchasusinganINSERTstatement)tobewrittenistranslatedbytheSerializerclassdefinedinSerDetotheformatthattheOUTPUTFORMATclasscanread.
2. DataisprocessedbytheOUTPUTFORMATimplementation,whichcreatestheRecordWriterobject.SimilartotheINPUTFORMATimplementation,theOUTPUTFORMATimplementationisspecifiedinthesamewayasatablewhereitwritesthedata.
3. Thedataiswrittentothetable(datasavedintheHDFS).
Foranexampleofwritingdata,wecanwritearow-columnofdatatoHivetablesusingJSONSerDe,whichtranslatesdatatoaJSONtextstringsavedtotheHDFS.
RecentHiveversionsusestheorg.apache.hadoop.hive.serde2library,whereorg.apache.hadoop.hive.serdeisthedeprecatedlibrary.AlistofcommonlyusedSerDeinHiveisasfollows:
LazySimpleSerDe:Thedefaultbuilt-inSerDe(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)that’susedwiththeTEXTFILEformat.Itcanbeimplementedasfollows:
jdbc:hive2://>CREATETABLEtest_serde_lz
.......>STOREDASTEXTFILEAS
.......>SELECTnamefromemployee;
Norowsaffected(32.665seconds)
ColumnarSerDe:Thisisthebuilt-inSerDeusedwiththeRCFILEformat.Itcanbeusedasfollows:
www.it-ebooks.info
jdbc:hive2://>CREATETABLEtest_serde_cs
.......>ROWFORMATSERDE
.......>'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
.......>STOREDASRCFileAS
.......>SELECTnamefromemployee;
Norowsaffected(27.187seconds)
RegexSerDe:Thisisthebuilt-inJavaregularexpressionSerDetoparsetextfiles.Itcanbeusedasfollows:
--Parse,seperatefields
jdbc:hive2://>CREATETABLEtest_serde_rex(
.......>namestring,
.......>sexstring,
.......>agestring
.......>)
.......>ROWFORMATSERDE
.......>'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
.......>WITHSERDEPROPERTIES(
.......>'input.regex'='([^,]*),([^,]*),([^,]*)',
.......>'output.format.string'='%1$s%2$s%3$s'
.......>)
.......>STOREDASTEXTFILE;
Norowsaffected(0.266seconds)
HBaseSerDe:Thisisthebuilt-inSerDetoenableHivetointegratewithHBase.WecanstoreHivetablesinHBasebyleveragingthisSerDe.MakesuretohaveHBaseinstalledbeforerunningthefollowingquery:
jdbc:hive2://>CREATETABLEtest_serde_hb(
.......>idstring,
.......>namestring,
.......>sexstring,
.......>agestring
.......>)
.......>ROWFORMATSERDE
.......>'org.apache.hadoop.hive.hbase.HBaseSerDe'
.......>STOREDBY
.......>'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
.......>WITHSERDEPROPERTIES(
.......>"hbase.columns.mapping"=
.......>":key,info:name,info:sex,info:age"
.......>)
.......>TBLPROPERTIES("hbase.table.name"="test_serde");
Norowsaffected(0.387seconds)
AvroSerDe:Thisisthebuilt-inSerDethatenablesreadingandwritingAvro(seehttp://avro.apache.org/)datainHivetables.Avroisaremoteprocedurecallanddataserializationframework.SinceHive0.14.0,Avro-backedtablescansimplybecreatedbyusingtheCREATETABLE…STOREDASAVROstatement,asfollows:
jdbc:hive2://>CREATETABLEtest_serde_avro(
.......>namestring,
.......>sexstring,
.......>agestring
www.it-ebooks.info
.......>)
.......>ROWFORMATSERDE
.......>'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
.......>STOREDASINPUTFORMAT
.......>
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
.......>OUTPUTFORMAT
.......>
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
.......>;
Norowsaffected(0.31seconds)
ParquetHiveSerDe:Thisisthebuilt-inSerDe(parquet.hive.serde.ParquetHiveSerDe)thatenablesreadingandwritingtheParquetdataformatsinceHive0.13.0.Itcanbeusedasfollows:
jdbc:hive2://>CREATETABLEtest_serde_parquet
.......>STOREDASPARQUETAS
.......>SELECTnamefromemployee;
Norowsaffected(34.079seconds)
OpenCSVSerDe:ThisistheSerDetoreadandwriteCSVdata.Itcomesasabuilt-inSerDesinceHive0.14.0.Wecanalsoinstalltheimplementationfromotheropensourcelibraries,suchashttps://github.com/ogrodnek/csv-serde.Itcanbeusedasfollows:
jdbc:hive2://>CREATETABLEtest_serde_csv(
.......>namestring,
.......>sexstring,
.......>agestring
.......>)
.......>ROWFORMATSERDE
.......>'org.apache.hadoop.hive.serde2.OpenCSVSerde'
.......>STOREDASTEXTFILE;
JSONSerDe:Thisisathird-partySerDetoreadandwriteJSONdatarecordswithHive.Makesuretoinstallit(fromhttps://github.com/rcongiu/Hive-JSON-Serde)beforerunningthefollowingquery:
jdbc:hive2://>CREATETABLEtest_serde_js(
.......>namestring,
.......>sexstring,
.......>agestring
.......>)
.......>ROWFORMATSERDE'org.openx.data.jsonserde.JsonSerDe'
.......>STOREDASTEXTFILE;
Norowsaffected(0.245seconds)
HivealsoallowsuserstodefineacustomizedSerDeifnoneoftheseworkfortheirdataformat.FormoreinformationaboutcustomSerDe,pleaserefertotheApachewikiathttps://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HowtoWriteYourOwnSerDe.
www.it-ebooks.info
www.it-ebooks.info
SummaryInthischapter,weintroducedthreemainareastoextendHive’sfunctionalities.Wealsocoveredthreeuser-definedfunctionsinHiveaswellasthecodingtemplateanddeploymentstepstoguideyourcodinganddeploymentpractice.Then,wetalkedaboutstreaminginHivetopluginyourowncode,whichdoesnothavetobeJavacode.Attheendofthischapter,wediscussedtheavailableSerDeinHivetoparsedifferentformatsofdatafileswhenreadingorwritingdata.Aftergoingthroughthischapter,weshouldbeabletowritebasicUDFs,plugcodeinstreamings,anduseavailableSerDeinHive.
Inthenextchapter,we’lltalkaboutsecurityconsiderationsforHive.
www.it-ebooks.info
www.it-ebooks.info
Chapter9.SecurityConsiderationsInmostopensourcesoftware,securityisoneofthemostimportantareas,butalwaysaddressedatalaterstage.AsthemainSQL-likeinterfaceofdatainHadoop,Hivemustensurethatdataissecurelyprotectedandaccessed.Forthisreason,securityinHiveisnowconsideredasanintegralandimportantpartoftheHadoopecosystem.TheearlierversionofHivemainlyreliedontheHDFSforsecurity.ThesecurityofHivegraduallybecamematureafterHiveServer2wasreleasedasanimportantmilestoneoftheHiveserver.
ThischapterwilldiscussHivesecurityinthefollowingareas:
AuthenticationAuthorizationEncryption
www.it-ebooks.info
AuthenticationAuthenticationistheprocessofverifyingtheidentityofauserbyobtainingtheuser’scredentials.HivehasofferedauthenticationsinceHiveServer2.InthepreviousHiveServer,ifwecouldaccessthehost/portoverthenetwork,wecouldaccessthedata.Inthiscase,theHiveMetastoreServercanbeusedtoauthenticatethriftclientsusingKerberos.AsmentionedinChapter2,SettingUptheHiveEnvironment,itisstronglyrecommendedtoupgradetheHiveservertoHiveServer2intermsofsecurityandreliability.Inthissection,wewillbrieflytalkaboutauthenticationconfigurationsinbothMetastoreServerandHiveServer2.
NoteKerberos
KerberosisanetworkauthenticationprotocoldevelopedbyMITaspartofProjectAthena.Itusestime-sensitiveticketsthataregeneratedusingsymmetrickeycryptographytosecurelyauthenticateauserinanunsecurednetworkenvironment.KerberosisderivedfromGreekmythology,whereKerberoswasthethree-headeddogthatguardedthegatesofHades.Thethree-headedpartreferstothethreepartiesinvolvedintheKerberosauthenticationprocess:client,server,andKeyDistributionCenter(KDC).AllclientsandserversregisteredtoKDCareknownasarealm,whichistypicallythedomain’sDNSnameinallcaps.Formoreinformation,pleaserefertotheMITKerberoswebsiteathttp://web.mit.edu/kerberos/.
www.it-ebooks.info
MetastoreserverauthenticationToforceclientstoauthenticatewiththeHiveMetastoreserverusingKerberos,wecansetthefollowingpropertiesinthehive-site.xmlfile:
EnabletheSimpleAuthenticationandSecurityLayer(SASL)frameworktoenforceclientKerberosauthentication,asfollows:
<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
<description>Iftrue,themetastorethriftinterfacewillbesecured
withSASLframework.ClientsmustauthenticatewithKerberos.
</description>
</property>
SpecifytheKerberoskeytabthatisgenerated.Overridethefollowingexampleifwewanttokeepthefileinanotherplace.Makesurethefileaccesspermissionsaresetto400implyingonlyreadpermissionfortheownertoavoidtheiridentitybeingstolenbyothers:
<property>
<name>hive.metastore.kerberos.keytab.file</name>
<value>/etc/hive/conf/hive.keytab</value>
<description>ThesamplepathtotheKerberosKeytabfilecontaining
themetastorethriftserver'sserviceprincipal.</description>
</property>
SpecifytheKerberosprincipalpatternstring.Thespecialstring_HOSTwillbereplacedautomaticallywiththecorrecthostnames.TheYOUR-REALM.COMvalueshouldbereplacedbytheactualrealmname:
<property>
<name>hive.metastore.kerberos.principal</name>
<value>hive/[email protected]</value>
<description>Theserviceprincipalforthemetastorethriftserver.
</description>
</property>
www.it-ebooks.info
HiveServer2authenticationHiveServer2supportsthefollowingauthentications.ToconfigureHiveServer2touseoneoftheseauthenticationmodes,wecansettheproperpropertiesinhive_site.xmlasfollows:
Noneauthentication:Noneauthenticationiswhat’sinthedefaultsettings.“None”heremeansHiveallowsanonymousaccessasshowninthefollowingsetting:
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
</property>
Kerberosauthentication:IfKerberosauthenticationisused,authenticationissupportedbetweenthethriftclientandHiveServer2,andbetweenHiveServer2andsecureHDFS.ToenableKerberosauthenticationforHiveServer2,wecansetthefollowingpropertiesbyoverridingthekeytabpath(ifwewanttokeepthefileinanotherplace)aswellaschangingYOUR-REALM.COMtotheactualrealmname:
<property>
<name>hive.server2.authentication</name>
<value>KERBEROS</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.keytab</name>
<value>/etc/hive/conf/hive.keytab</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>hive/[email protected]</value>
</property>
OnceKerberosisenabled,theJDBCclient(suchasBeeline)mustincludetheprincipalparameterintheJDBCconnectionstringsuchasthefollowing:
jdbc:hive2://HiveServer2HostName:10000/default;principal=hive/HiveServe
LDAPauthentication:ToconfigureHiveServer2touseuserandpasswordvalidationbackedbyLDAP(seehttp://tools.ietf.org/html/rfc4511),wecansetthefollowingproperties:
<property>
<name>hive.server2.authentication</name>
<value>LDAP</value>
</property>
<property>
<name>hive.server2.authentication.ldap.url</name>
<value>LDAP_URL,suchasldap://[email protected]</value>
</property>
<property>
<name>hive.server2.authentication.ldap.Domain</name>
<value>YourDomainName</value>
www.it-ebooks.info
</property>
ToconfigurewithOpenLDAP,wecanaddthesettingofbaseDNinsteadoftheDomainpropertyasfollows:
<property>
<name>hive.server2.authentication.ldap.baseDN</name>
<value>LDAP_BaseDN,suchasou=people,dc=packtpub,dc=com</value>
</property>
Pluggablecustomauthentication:PluggablecustomauthenticationprovidesacustomauthenticationproviderforHiveServer2.Toenableit,configurethesettingsasfollows:
<property>
<name>hive.server2.authentication</name>
<value>CUSTOM</value>
</property>
<property>
<name>hive.server2.custom.authentication.class</name>
<value>pluggable-auth-class-name</value>
<description>Customauthenticationclassname,suchas
com.packtpub.hive.essentials.hiveudf.customAuthenticator
</description>
</property>
NoteThepluggableauthenticationwithacustomizedclassdidnotworkuntilthebug(seehttps://issues.apache.org/jira/browse/HIVE-4778)wasfixedinHive0.13.0.
Thefollowingisasampleofacustomizedclassthatimplementstheorg.apache.hive.service.auth.PasswdAuthenticationProviderinterface.TheoverriddenAuthenticatemethodhasthecorelogicofhowtoauthenticateausernameandpassword.MakesuretocopythecompiledJARfileto$HIVE_HOME/lib/sothattheprecedingsettingscanwork.
customAuthenticator.java
packagecom.packtpub.hive.essentials.hiveudf;
importjava.util.Hashtable;
importjavax.security.sasl.AuthenticationException;
importorg.apache.hive.service.auth.PasswdAuthenticationProvider;
/*
*ThecustomizedclassforHiveServer2authentication
*/
publicclasscustomAuthenticatorimplements
PasswdAuthenticationProvider{
Hashtable<String,String>authHashTable=null;
publiccustomAuthenticator(){
authHashTable=newHashtable<String,String>();
www.it-ebooks.info
authHashTable.put("user1","passwd1");
authHashTable.put("user2","passwd2");
}
@Override
publicvoidAuthenticate(Stringuser,Stringpassword)
throwsAuthenticationException{
StringstoredPasswd=authHashTable.get(user);
if(storedPasswd!=null&&storedPasswd.equals(password))
return;
thrownewAuthenticationException("customAuthenticatorException:
Invaliduser");
}
}
PluggableAuthenticationModules(PAM)authentication:SinceHive0.13.0,itsupportsPAMauthentication,whichprovidesthebenefitofpluggingexistingauthenticationmechanismstoHive.ConfigurethefollowingsettingstoenablePAMauthentication.FormoreinformationabouthowtoinstallPAM,pleaserefertotheSettingUpHiveServer2articleintheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PluggableAuthenticationModules(PAM).
<property>
<name>hive.server2.authentication</name>
<value>PAM</value>
</property>
<property>
<name>hive.server2.authentication.pam.services</name>
<value>pluggable-auth-class-name</value>
<description>Setthistoalistofcomma-separatedPAMservicesthat
willbeused.NotethatafilewiththesamenameasthePAMservice
mustexistin/etc/pam.d.</description>
</property>
www.it-ebooks.info
www.it-ebooks.info
AuthorizationAuthorizationinHiveisusedtoverifyifauserhaspermissiontoperformacertainaction,suchascreating,reading,andwritingdataormetadata.Hiveprovidesthreeauthorizationmodes:legacymode,storage-basedmode,andSQLstandard-basedmode.
www.it-ebooks.info
LegacymodeThisisthedefaultauthorizationmodeinHive,providingcolumnandrow-levelauthorizationthroughHQLstatements.However,itisnotacompletelysecureauthorizationmodeandhasacoupleoflimitations.Itcanbemainlyusedtopreventgoodusersfromaccidentallydoingbadthingsratherthanpreventingmalicioususers’operations.Inordertoenablethelegacyauthorizationmode,weneedtosetthefollowingpropertiesinhive-site.xml:
<property>
<name>hive.security.authorization.enabled</name>
<value>true</value>
<description>enablesordisablethehiveclientauthorization
</description>
</property>
<property>
<name>hive.security.authorization.createtable.owner.grants</name>
<value>ALL</value>
<description>theprivilegesautomaticallygrantedtotheownerwhenevera
tablegetscreated.Anexamplelike"select,drop"willgrantselectand
dropprivilegetotheownerofthetable.
</description>
</property>
Sincethisisnotasecureauthorizationmode,wewillnotdiscussmoredetailshere.FormoreHQLsupportinthelegacyauthorizationmode,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Hive+Default+Authorization+-+Legacy+Mode.
www.it-ebooks.info
Storage-basedmodeThestorage-basedauthorizationmode(sinceHive0.10.0)reliesontheauthorizationprovidedbythestoragelayerHDFS,whichprovidesbothPOSIXandACLpermissions(availablesinceHive0.14.0;refertohttps://issues.apache.org/jira/browse/HIVE-7583).Thestorage-basedauthorizationisenabledintheHiveMetastoreserverhavingasingleconsistentviewofmetadataacrossotherapplicationsintheecosystem.ThismodechecksHiveuserpermissionsagainstthePOSIXpermissionsonthecorrespondingfiledirectoriesinHDFS.InadditiontothePOSIXpermissionsmodel,HDFSalsoprovidesaccesscontrollistsdescribedinACLsonHDFSathttp://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#ACLs_Access_Control_Lists.Consideringitsimplementation,thestorage-basedauthorizationmodeonlyoffersauthorizationatthelevelofHivedatabases,tables,andpartitionsratherthancolumnandrowlevel.WithdependencyontheHDFSpermissions,itlackstheflexibilitytomanagetheauthorizationthroughHQLstatements.
Toenablestorage-basedauthorizationmode,wecansetthefollowingpropertiesinthehive-site.xmlfile:
<property>
<name>hive.security.authorization.enabled</name>
<value>true</value>
<description>enableordisablethehiveclientauthorization
</description>
</property>
<property>
<name>hive.security.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthori
zationProvider</value>
<description>TheclassnameoftheHiveclientauthorizationmanager.
</description>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
<description>AllowsHivequeriestoberunbytheuserwhosubmitsthe
queryratherthanthehiveuser.</description>
</property>
</property>
<name>hive.metastore.pre.event.listeners</name>
<value>org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEve
ntListener</value>
<description>Thisturnsonmetastore-sidesecurity.</description>
</property>
<property>
<name>hive.security.metastore.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthori
zationProvider</value>
<description>authenticatormanagerclassnametobeusedinthe
www.it-ebooks.info
metastoreforauthentication.</description>
</property>
NoteSinceHive0.14.0,storage-basedauthorizationalsoauthorizesreadprivilegesondatabasesandtablesbydefaultthroughthehive.security.metastore.authorization.auth.readsproperty.Formoreinformation,pleaserefertohttps://issues.apache.org/jira/browse/HIVE-8221.
www.it-ebooks.info
SQLstandard-basedmodeForfine-grainedaccesscontrolonacolumnandrowlevel,wecanuseSQLstandard-basedmodeavailablesinceHive0.13.0.ItissimilartotheSQLauthorizationbyusingtheGRANTandREVOKEstatementstocontrolaccessthroughtheHiveServer2configuration.However,toolssuchasHiveCLIandHadoop/HDFS/MapReducecommandsdonotaccessdatathroughHiveServer2,soSQLstandard-basedmodecannotauthorizetheiraccess.Therefore,itisrecommendedtousestorage-basedmodetogetherwithSQLstandard-basedmodeauthorizationtoauthorizeuserswhodonotaccessfromHiveServer2.
ToenableSQLstandard-basedmodeauthorization,wecansetthefollowingpropertiesinthehive-site.xmlfile:
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
<description>AllowsHivequeriestoberunbytheuserwhosubmitsthe
queryratherthanthehiveuser.NeedtoturnifoffforthisSQLstandard-
basemode</description>
</property>
<property>
<name>hive.users.in.admin.role</name>
<value>dayongd,administrator</value>
<description>Comma-separatedlistofusersassignedtotheADMINrole.
</description>
</property>
<property>
<name>hive.security.authorization.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.security.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.plugin.sql</value>
</property>
<property>
<name>hive.security.authenticator.manager</name>
<value>org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator</va
lue>
</property>
<property>
<name>hive.metastore.uris</name>
<value>""</value>
<description>""(quotationmarkssurroundingasingleemptyspace).
</description>
</property>
BeforerestartingHiveServer2,theusersintheconfiguredadminrolemustrunthefollowingcommandtomaketheadminroleeffective,andthenrestartHiveServer2:
jdbc:hive2://>GRANTadminTOUSERdayongd;
www.it-ebooks.info
Thebasicsyntaxtograntorrevokeanauthorizationroleorprivilegeisasfollows:
GRANT<ROLENAME>TO<USERS>[WITHADMINOPTION];
REVOKE[ADMINOPTIONFOR]<ROLENAME>FROM<USERS>;
Here,thefollowingparametersareused:
<ROLENAME>:Thiscanbeacomma-separatednameofroles<USERS>:ThiscanbeauseroraroleWITHADMINOPTION:Thismakessurethattheusergetsprivilegestogranttheroletootherusers/roles
Anotherexampletograntorrevokeanauthorizationisasfollows:
GRANT<PRIVILEGE>ON<OBJECT>TO<USERS>;
REVOKE<PRIVILEGE>ON<OBJECT>FROM<USERS>;
Here,thefollowingparametersareused:
<PRIVILEGE>:ThiscanbeINSERT,SELECT,UPDATE,DELETE,orALL<USERS>:Thiscanbeauserorarole<OBJECT>:Thisisatableoraview
FormoreexamplesofHQLstatementstomanageSQLstandard-basedauthorization,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization#SQLStandardBasedHiveAuthorization-Configuration.
NoteSentry
Sentryisahighlymodularsystemforprovidingcentralized,fine-grained,role-basedauthorizationtobothdataandmetadatastoredonanApacheHadoopcluster.ItcanbeintegratedwithHivetodeliveradvancedauthorizationcontrols.FormoreinformationaboutSentry,pleaserefertohttp://incubator.apache.org/projects/sentry.html.
www.it-ebooks.info
www.it-ebooks.info
EncryptionForsensitiveandlegallyprotecteddatasuchaspersonalidentityinformation(PII),itisrequiredtostorethedatainencryptedformatinthefilesystem.However,Hivedoesnotnativelysupportencryptionanddecryptionyet(seehttps://issues.apache.org/jira/browse/HIVE-5207).
Alternatively,wecanlookforthird-partytoolstoencryptanddecryptdataafterexportingitfromHive,butthisrequiresadditionalpostprocessing.ThenewHDFSencryption(seehttps://issues.apache.org/jira/browse/HDFS-6134)offersgreattransparentencryptionanddecryptionofdataonHDFS.ItwillsatisfyourrequestifwewanttoencryptthewholedatasetinHDFS.However,itcannotbeappliedtotheselectedcolumnandrowlevelinthetableofHive,wheremostPIIthatisencryptedisonlyapartofrawdata.Inthiscase,thebestsolutionfornowistouseHiveUDFtopluginencryptionanddecryptionimplementationsonselectedcolumnsorpartialdataintheHivetables.
SampleUDFimplementationsforencryptionanddecryptionusingtheAESencryptionalgorithmareasfollows:
AESEncrypt.java:Theimplementationisasfollows:
packagecom.packtpub.hive.essentials.hiveudf;
importorg.apache.hadoop.hive.ql.exec.UDF;
importorg.apache.hadoop.hive.ql.exec.Description;
importorg.apache.hadoop.hive.ql.udf.UDFType;
@Description(
name="aesencrypt",
value="_FUNC_(str)-ReturnsencryptedstringbasedonAES
key.",
extended="Example:\n"+
">SELECTaesencrypt(pii_info)FROMtable_name;\n"
)
@UDFType(deterministic=true,stateful=false)
/*
*AHiveencryptionUDF
*/
publicclassAESEncryptextendsUDF{
publicStringevaluate(Stringunencrypted){
Stringencrypted="";
if(unencrypted!=null){
try{
encrypted=CipherUtils.encrypt(unencrypted);
}catch(Exceptione){};
}
returnencrypted;
}
}
AESDecrypt.java:Thiscanbeimplementedasfollows:
packagecom.packtpub.hive.essentials.hiveudf;
www.it-ebooks.info
importorg.apache.hadoop.hive.ql.exec.UDF;
importorg.apache.hadoop.hive.ql.exec.Description;
importorg.apache.hadoop.hive.ql.udf.UDFType;
@Description(
name="aesdecrypt",
value="_FUNC_(str)-Returnsunencryptedstringbasedon
AESkey.",
extended="Example:\n"+
">SELECTaesdecrypt(pii_info)FROMtable_name;\n"
)
@UDFType(deterministic=true,stateful=false)
/*
*AHivedecryptionUDF
*/
publicclassAESDecryptextendsUDF{
publicStringevaluate(Stringencrypted){
Stringunencrypted=newString(encrypted);
if(encrypted!=null){
try{
unencrypted=CipherUtils.decrypt(encrypted);
}catch(Exceptione){};
}
returnunencrypted;
}
}
CipherUtils.java:Thiscanbeimplementedasfollows:
packagecom.packtpub.hive.essentials.hiveudf;
importjavax.crypto.Cipher;
importjavax.crypto.spec.SecretKeySpec;
importorg.apache.commons.codec.binary.Base64;
/*
*Thecoreencryptionanddecryptionlogicfunction
*/
publicclassCipherUtils
{
//ThisisasecretkeyintermsofASCII
privatestaticbyte[]key={
0x75,0x69,0x69,0x73,0x40,0x73,0x41,0x53,0x65,0x65,
0x72,0x69,0x74,0x4b,0x65,0x75
};
publicstaticStringencrypt(StringstrToEncrypt)
{
try
{
//preparealgorithm
Ciphercipher=Cipher.getInstance("AES/ECB/PKCS5Padding");
finalSecretKeySpecsecretKey=newSecretKeySpec(key,
"AES");
//initializecipherforencryption
cipher.init(Cipher.ENCRYPT_MODE,secretKey);
www.it-ebooks.info
//Base64.encodeBase64Stringthatgivesanasciistring
finalStringencryptedString=
Base64.encodeBase64String(cipher.doFinal(strToEncrypt.getBytes()));
returnencryptedString.replaceAll("\r|\n","");
}
catch(Exceptione)
{
e.printStackTrace();
}
returnnull;
}
publicstaticStringdecrypt(StringstrToDecrypt)
{
try
{
//preparealgorithm
Ciphercipher=Cipher.getInstance("AES/ECB/PKCS5PADDING");
finalSecretKeySpecsecretKey=newSecretKeySpec(key,
"AES");
//initializecipherfordecryption
cipher.init(Cipher.DECRYPT_MODE,secretKey);
finalStringdecryptedString=new
String(cipher.doFinal(Base64.decodeBase64(strToDecrypt)));
returndecryptedString;
}
catch(Exceptione)
{
e.printStackTrace();
}
returnnull;
}
}
NoteAES
ShortforAdvancedEncryptionStandard,AESisasymmetric128-bitblockdataencryptiontechniquedevelopedbyBelgiancryptographersJoanDaemenandVincentRijmen.Formoreinformation,pleaserefertohttp://en.wikipedia.org/wiki/Advanced_Encryption_Standard.
TodeploytheUDFandverifythem,dothefollowing:
jdbc:hive2://>ADDJAR/home/dayongd/Downloads/
.......>hiveessentials-1.0-SNAPSHOT.jar;
Norowsaffected(0.002seconds)
jdbc:hive2://>CREATETEMPORARYFUNCTIONaesdecryptAS
.......>'com.packtpub.hive.essentials.hiveudf.AESDecrypt';
Norowsaffected(0.02seconds)
jdbc:hive2://>CREATETEMPORARYFUNCTIONaesencryptAS
www.it-ebooks.info
.......>'com.packtpub.hive.essentials.hiveudf.AESEncrypt';
Norowsaffected(0.015seconds)
jdbc:hive2://>SELECTaesencrypt('Will')ASencrypt_name
.......>FROMemployeeLIMIT1;
+---------------------------+
|encrypt_name|
+---------------------------+
|YGvo54QIahpb+CVOwv9OkQ==|
+---------------------------+
1rowselected(34.494seconds)
jdbc:hive2://>SELECTaesdecrypt('YGvo54QIahpb+CVOwv9OkQ==')
.......>ASdecrypt_name
.......>FROMemployeeLIMIT1;
+---------------+
|decrypt_name|
+---------------+
|Will|
+---------------+
1rowselected(45.43seconds)
www.it-ebooks.info
www.it-ebooks.info
SummaryInthischapter,weintroducedthreemainareasforHivesecurity:authentication,authorization,andencryption.WecoveredtheauthenticationsinMetastoreserverandHiveServer2.Then,wetalkedaboutdefault,storage-based,andSQLstandard-basedauthorizationmethodsinHiveServer2.Attheendofthischapter,wediscussedtheuseofHiveUDFforencryptionanddecryption.Aftergoingthroughthischapter,weshouldclearlyunderstandthedifferentareasthatwillhelpusaddressHivesecurity.
Inthenextchapter,we’lltalkaboutusingHivewithothertools.
www.it-ebooks.info
www.it-ebooks.info
Chapter10.WorkingwithOtherToolsAsoneoftheearliestandmostpopularSQLoverHadooptools,Hivehasmanyusecasesofworkingwithothertoolstoofferanend-to-enddataintelligencesolution.Inthischapter,wewilldiscussthewayHiveworkswithotherbigdatatoolsinthefollowingareas:
JDBC/ODBCconnectorHBaseHueHCatalogZookeeperOozieHiveroadmap
www.it-ebooks.info
JDBC/ODBCconnectorJDBC/ODBCisoneofthemostcommonwaysforHivetoworkwithothertools.Hadoopvendors,suchasClouderaandHortonworks,offerfreeHiveJDBC/ODBCdriverssothatHivecanbeconnectedthroughthesedrivers;thesecanbefoundatthefollowinglinks:
ForCloudera,thelinkishttp://www.cloudera.com/content/cloudera/en/downloads/connectors/hive.htmlForHortonworks,thelinkishttp://hortonworks.com/hdp/addons/
WecanusetheseJDBC/ODBCconnectorstoconnectHivetotoolssuchasthefollowing:
Acommand-lineutilitysuchasBeeline,mentionedinChapter2,SettingUptheHiveEnvironmentIntegrateddevelopmentenvironmentsuchasOracleSQLDeveloper,mentionedinChapter2,SettingUptheHiveEnvironmentDataextraction,transformation,loading,andintegrationtools,suchasTalendOpenStudioBusinessintelligencereportingtools,suchasJasperReportsandQlikViewDataanalysistoolssuchasMicrosoftExcel2013DatavisualizationtoolssuchasTableau
Sincethesetupofconnectorsisverystraightforward,pleaserefertothewebsitesoftheprecedingtoolsformoredetailedinstructionstoconnecttoHive.
www.it-ebooks.info
www.it-ebooks.info
HBaseHBase(seehttp://hbase.apache.org/)isahigh-performanceNoSQLkey/valuestoreonHadoop.HivehasofferedastoragehandlermechanismtointegratewithHBasebyusingtheHBaseStorageHandlerclassthatcreatesHBasetablesmanagedbyHive.ByintegratingHivewithHBase,Hiveuserscanleveragereal-timetransactionperformanceofHBasetodoreal-timebigdataanalysis.Currently,theintegrationfeatureisstillinprogress,especiallyintheareaofofferinghigherperformanceandsnapshotssupport.ThereisanotherprojectcalledPhoenix(seehttp://phoenix.apache.org/),whichprovidesbasicSQLwithhigher-performancesupportoverHBase.
AnexampleofcreatinganHBasetableinHQLisasfollows:
CREATETABLEhbase_table_sample(
idint,
value1string,
value2string,
map_valuemap<string,string>
)
STOREDBY'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITHSERDEPROPERTIES("hbase.columns.mapping"=
":key,cf1:val,cf2:val,cf3:")
TBLPROPERTIES("hbase.table.name"="table_name_in_hbase");
InthisspecialCREATETABLEstatement,theHBaseStorageHandlerclassisdelegatinginteractionwiththeHBasetablewithHiveHBaseTableInputFormatandHiveHBaseTableOutputFormat.Thehbase.columns.mappingpropertyisrequiredtomapeachtablecolumndefinedinthestatementtotheHBasetablecolumnsinorder.Forexample,theID,byorder,mapstotheHBasetable’srowkeyas:key.Sometimes,wemayneedtogeneratetheproperrowkeycolumnsusingHiveUDFsifthereisnoexistingcolumnthatcanbeusedasarowkeyfortheHBasetable.Thevalue1mapstothevalcolumninthecf1columnfamilyintheHBasetable.TheHiveMAPdatatypecanbeusedtoaccessanentirecolumnfamily.Eachrowcanhaveadifferentsetofcolumns,wherethecolumnnamescorrespondtothemapkeysandthecolumnvaluescorrespondtothemapvalues,suchasthemap_valuecolumns.Thehbase.table.nameproperty,whichisoptional,specifiesthetablenameknownbyHBase.Ifitisnotprovided,theHiveandHBasetablewillhavethesamename,suchashbase_table_sample.
NoteFormoreinformationaboutconfigurationsandfeaturesinprogressaboutHive-HBaseintegration,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/HBaseIntegration.
www.it-ebooks.info
www.it-ebooks.info
HueHue(seehttp://gethue.com/)isshortforHadoopUserExperience.ItisawebinterfaceformakingtheHadoopecosystemeasiertouse.ForHiveusers,HueoffersaunifiedwebinterfaceforeasilyaccessingbothHDFSandHiveinaninteractiveenvironment.HuecanbeinstalledaloneorwiththeHadoopvendorpackages.Inaddition,Hueaddsmoreprogramming-friendlyfeaturestoHive,suchasthefollowing:
HighlightsHQLkeywordsAutocompletesHQLqueryOffersliveprogressandlogsforHiveandMapReducejobsSubmitsseveralqueriesandchecksprogresslaterBrowsesdatainHivetablesthroughawebuserinterfaceNavigatesthroughthemetadataRegistersUDFandaddsfiles/archivesthroughawebuserinterfaceSaves,exports,andsharesthequeryresultCreatesvariouschartsfromthequeryresult
ThefollowingisascreenshotoftheHiveeditorinterfaceinHue:
HueHiveeditoruserinterface
www.it-ebooks.info
www.it-ebooks.info
HCatalogHCatalog(seehttps://cwiki.apache.org/confluence/display/Hive/HCatalog)isametadatamanagementsystemforHadoopdata.ItstoresconsistentschemainformationforHadoopecosystemtools,suchasPig,Hive,andMapReduce.Bydefault,HCatalogsupportsdataintheformatofRCFile,CSV,JSON,SequenceFile,ORCfile,andacustomizedformatifInputFormat,OutputFormat,andSerDeareimplemented.ByusingHCatalog,usersareabletodirectlycreate,edit,andexpose(viaitsRESTAPI)metadata,whichbecomeseffectiveimmediatelyinalltoolssharingthesamepieceofmetadata.Atfirst,HCatalogwasaseparateApacheprojectfromHiveandwaspartofApacheIncubator,wheremostApacheprojectsfirststarted.Eventually,HCatalogbecameapartoftheHiveprojectin2013startingwithHive0.11.0.
HCatalogisbuiltontopoftheHivemetastoreandincorporatessupportforHiveDDL.ItprovidesreadandwriteinterfacesandHCatLoaderandHCatStorer,forPig,byimplementingPig’sloadandstoreinterfaces,respectively.HCatalogalsoprovidesaninterfaceforMapReduceprogramsbyusingHCatInputFormatandHCatOutputFormat,whichareverysimilartoothercustomizedformatsbyimplementingHadoop’sInputFormatandOutputFormat.HCatalogprovidesaRESTAPIfromacomponentcalledWebHCatsothatHTTPrequestscanbemadetoaccessthemetadataofHadoopMapReduce/Yarn,Pig,Hive,andHCatalogDDLfromotherapplications.ThereisnoHive-specificinterfacesinceHCatalogusesHive’smetastore.Therefore,HCatalogcandefinemetadataforHivedirectlythroughitsCLI.TheHCatalogCLIsupportstheHQLSHOW/DESCRIBEstatementandthemajorityofHiveDDL,exceptthefollowingstatements,thatrequirerunningMapReducejobs:
CREATETABLE…ASSELECT
ALTERINDEX…REBUILD
ALTERTABLE…CONCATENATE
ALTERTABLEARCHIVE/UNARCHIVEPARTITION
ANALYZETABLE…COMPUTESTATISTICS
IMPORT/EXPORT
www.it-ebooks.info
www.it-ebooks.info
ZooKeeperZooKeeper(seehttp://zookeeper.apache.org/)isacentralizedserviceforconfigurationmanagementandthesynchronizationofvariousaspectsofnamingandcoordination.Itmanagesanamingregistryandeffectivelyimplementsasystemformanagingthevariousstaticallyanddynamicallynamedobjectsinahierarchicalsystem.Italsoenablescoordinationandcontroltothesharedresources,suchasfilesanddata,whicharemanipulatedbymultipleconcurrentprocesses.
UnlikeRDBMS,Hivedoesnotnativelysupportconcurrencyaccessandlockingmechanisms.HivereliesonZooKeeperforlockingthesharedresourcessinceHive0.7.0.TherearetwotypesoflocksprovidedbyHivethroughZookeeperandtheyareasfollows:
Sharedlock:Thisisacquiredwhenatable/partitionisread.TheconcurrentsharedlocksareallowedinHive.Exclusivelock:Thisisacquiredforallotheroperationsthatmodifythetable.Forpartitiontables,onlyasharedlockisacquiredifthechangeisonlyapplicabletothenewly-createdpartitions.Anexclusivelockisacquiredonthetableifthechangeisapplicabletoallpartitions.Inaddition,anexclusivelockonthetablegloballyaffectsallpartitions.
AnyHQLmustacquireproperlocksbeforebeingallowedtoperformcorrespondinglock-permittedoperations.
ToenablelockinginHive,weneedtomakesureZooKeeperisinstalledandconfigured.Then,configurethefollowingpropertiesinHive’shive-site.xmlfile:
<property>
<name>hive.support.concurrency</name>
<description>EnableHive'sTableLockManagerService</description>
<value>true</value>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<description>CommaseparatedZookeeperquorumusedbyHive'sTableLock
Manager.</description>
<value>localhost.localdomain</value>
</property>
WecanalsosetthefollowingpropertytousethenewlockmanagerfortransactionssupportsinceHive0.13.0:
<property>
<name>hive.txn.manager</name>
<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
NoteOnceconfigured,wecanfurthersetlockingproperties,specifiedanddetailedathttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-
www.it-ebooks.info
Locking.
Locksareeitherimplicitlyacquired/releasedfromHQLorexplicitlyacquired/releasedusingtheLOCKandUNLOCKstatementsasfollows:
--Locktableandspecifylocktype
jdbc:hive2://>LOCKTABLEemployeeshared;
Norowsaffected(1.328seconds)
--Showthelockinformationonthespecifictables
jdbc:hive2://>SHOWLOCKSemployeeEXTENDED;
+------------------------------------------------------------------------+-
----+
|tab_name|
mo|
+------------------------------------------------------------------------+-
----+
|default@employee|
SHA|
|LOCK_QUERYID:hive_20150105170303_792598b1-0ac8-4aad-aa4e-c4cdb0de6697|
|
|LOCK_TIME:1420495466554|
|
|LOCK_MODE:EXPLICIT|
|
|LOCK_QUERYSTRING:LOCKTABLEemployeeshared|
|
+------------------------------------------------------------------------+-
----+
5rowsselected(0.576seconds)
--Releasethelockonthetable
jdbc:hive2://>UNLOCKTABLEemployee;
Norowsaffected(0.209seconds)
--Showalllocksinthedatabase
jdbc:hive2://>SHOWLOCKS;
+-----------+-------+
|tab_name|mode|
+-----------+-------+
+-----------+-------+
Norowsselected(0.529seconds)
jdbc:hive2://>LOCKTABLEemployeeexclusive;
Norowsaffected(0.185seconds)
jdbc:hive2://>SHOWLOCKSemployeeEXTENDED;
+------------------------------------------------------------------------+-
----+
|tab_name|
mo|
+------------------------------------------------------------------------+-
----+
|default@employee|
EXC|
|LOCK_QUERYID:hive_20150105170808_bbc6db18-e44a-49a1-bdda-3dc30b5c8cee|
www.it-ebooks.info
|
|LOCK_TIME:1420495807855|
|
|LOCK_MODE:EXPLICIT|
|
|LOCK_QUERYSTRING:LOCKTABLEemployeeexclusive|
|
+------------------------------------------------------------------------+-
----+
5rowsselected(0.578seconds)
jdbc:hive2://>SELECT*FROMemployee;
Whenthetableacquiresanexclusivelock,theprecedingSELECTstatementwillwaitforthelockandshownothingasaresultsetunlessweunlockthetableintheothersession.FromtheHivelog,wecanfindthefollowinginformationthatspecifiesthattheSELECTstatementiswaitingtogetthereadlock:
15/01/0517:13:39INFOql.Driver:<PERFLOGmethod=acquireReadWriteLocks>
15/01/0517:13:39ERRORZooKeeperHiveLockManager:conflictinglockpresent
fordefault@employeemodeSHARED
NoteFormoreinformationaboutusingZooKeeperforHivelocks,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Locking.
www.it-ebooks.info
www.it-ebooks.info
OozieOozie(seehttp://oozie.apache.org/)isanopensourceworkflowcoordinationandscheduleservicetomanagedataprocessingjobs.OozieworkflowjobsaredefinedinaseriesofnodesinaDirectedAcyclicalGraph(DAG).Acyclicalheremeansthattherearenoloopsinthegraphandallnodesinthegraphflowinonedirectionwithoutgoingback.Oozieworkflowscontaineitherthecontrolflownodeoractionnode:
Controlflownode:Thiseitherdefinesthestart,end,andfailednodeinaworkfloworcontrolstheworkflowexecutionpathsuchasdecision,fork,andjoinnodes.Actionnode:ThisdefinesthecoredataprocessingactionjobsuchasMapReduce,Hadoopfilesystem,Hive,Pig,Java,Shell,e-mail,andOoziesubworkflows.Additionaltypesofactionsarealsosupportedbydevelopingextensions.
Oozieisascalable,reliable,andextensiblesystem.Itcanbeparameterizedforworkflowsubmissionandscheduledtorunautomatically.Therefore,Oozieisverysuitableforlightweightdataintegrationormaintenancejobs.
HueoffersveryfriendlyandpowerfulsupportforOoziethroughtheOozieeditor.CreatingandsubmittinganOozieworkflowofHiveactionsfromHueisasstraightforwardasthefollowingsteps:
1. LogintoHueandselectfromthetopmenubarWorkflows|Editors|WorkflowstoopenWorkflowManager.
2. ClickontheCreatebuttontocreateaworkflow.3. Giveaproperworkflownameandsavetheworkflow.4. Oncetheworkflowissaved,theOozieeditorwindowappearsforfurthersettings.5. DragaHiveactiontothemiddleofthestartandendnodes.6. IntheEditNode:menushown,thefollowingsettingsarepresent.Provideproper
settingsasfollows:
Name:Giveaproperactionname.Description:Thisiswheretodescribethejob.Thisisoptional.Advanced:ThisisforSLAmonitoring.Thisisoptional.Scriptname:ChoosetheHQLscriptsfromHDFSforHiveaction.Prepare:Defineactions,suchasdeletefilesorcreatefolders,beforerunningthescript.Thisisoptional.Parameters:Thisdefinestheparameterstobetakenwhensubmittingthejob(suchas${date}).Thisisoptional.Jobproperties:ThisiswheretosetHadoop/Hiveproperties.Thisisoptional.Files:Thisiswheretoselectthefilesneededforthescripts.Thisisoptional.Archives:ThisiswheretoselectthearchivefilessuchasUDFJARs.Thisisoptional.JobXML:Chooseacopyofthehive-site.xmlfileoftheHiveclusterfromHDFSsothatOoziecanconnecttotheHivemetastore.
7. ClickonDoneintheEditNode:menuandthenclickonSaveinWorkflowEditor.
www.it-ebooks.info
8. ClickonSubmittosubmittheworkflow.Then,theHiveactionistriggeredbytheOozieworkflowsuccessfully.
www.it-ebooks.info
www.it-ebooks.info
HiveroadmapAsitistheendofthischapteraswellasofthisbook,thehighlightofeachHivereleasemilestoneandfuturefeaturesexpectedaresummarizedasfollowsalongwithbestwishestotheHivecommunitiesforgrowingbiggerandbetterinthenearfuture:
December2011–Hive0.8.0
AddedBitmapindexesAddedtheTIMESTAMPdatatypeAddedtheHivePluginDeveloperKittomakepluginbuildingandtestingeasierImprovedJDBCDriverandbugfixes
April2012–Hive0.9.0
AddedtheCREATEORREPLACEVIEWstatementAddedNOTINandNOTLIKEsupportAddedtheBETWEENandNULLsafeequalityoperatorAddedprintf(),sort_array(),andconcat_ws()functionsAddedafilterpush-downfromHiveintoHBaseforthekeycolumnCombinedmultipleUNIONALLstatementsinoneMapReducejobCombinedmultipleGROUPBYstatementsonthesamedatawiththesamekeysinoneMapReducejob
January2013–Hive0.10.0
AddedtheCUBEandROLLUPstatementsAddedbettersupportforYARNAddedmoreinformationintheEXPLAINstatementAddedtheSHOWCREATETABLEstatementAddedbuilt-insupportforreading/writingAvrodataAddedimprovementsforskewedjoinsImprovedsimplequerieswithoutrunningMapReducejobsfaster
May2013–Hive0.11.0asStingerPhase1
AddedORCforbetterperformanceAddedanalyticandwindowsfunctionsAddedHCatalogaspartofHiveAddedGROUPBYcolumnpositionsImproveddatatypesandaddedtheDECIMALdatatypeImprovedjoinsforbroadcastandSMBjoinsImplementedHiveServer2
October2013–Hive0.12.0asStingerPhase2
AddedVARCHARandDATEsupportAddedparallelORDERBYtoHiveAddedmoreimprovementsforORC,suchaspredicatepush-downAddedacorrelationoptimizer
www.it-ebooks.info
AddedsupportforGROUPBYontheSTRUCTtypeAddedsupportfortheouterlateralviewPushedLIMITdowntomappers
April2014–Hive0.13.0asStingerPhase3Final
AddedDECIMALandCHARdatatypesAddedsupportforrunningjobsonTezAddedavectorizedqueryengineAddedsupportforsubqueriesforIN,NOTIN,EXISTS,andNOTEXISTSAddedsupportforpermanentfunctionsAddedsupportforcommontableexpressionsAddedSQLstandard-basedauthentication
November2014–Hive0.14.0asStinger.nextPhase1
AddedtransactionswithACIDsemanticsAddedaCostBaseOptimizer(CBO)AddedtheCREATETEMPORARYTABLEstatementAddedsupportfortheSTOREDASAVROintheCREATETABLEstatementAddedskipTrashconfigurationfortheDROPTABLEstatementAddedAccumuloStorageHandleUsedTezautoparallelisminHive
February2015–Hive1.0.0
Movedtoa1.x.yreleasenamingstructureMadeHiveMetaStoreClientapublicAPIRemovalofHiveServer1SwitchedtoTez0.5.2
Future
OffersubsecondquerieswithLiveLongAndProcess(LLAP)OfferHiveoverSparkSupportSQL2011analyticsSupportcross-geoqueriesOffermaterializedviewsOfferworkloadmanagementviaYARNandLLAPintegrationHiveasaunifieddataquerytool
www.it-ebooks.info
www.it-ebooks.info
SummaryInthisfinalchapter,weintroducedsomebigdatatools,whichcanworkwithHivethroughJDBCorODBCintegration,suchasHBase,Hue,HCatalog,ZooKeeper,andOozie.Then,wereviewedthekeyreleasesofHivefrom0.8.0to1.0.0,aswellastheexcitingfeaturesexpectedinthefuture.Aftergoingthroughthischapter,weshouldunderstandhowtouseotherbigdatatoolswithHivetoprovideend-to-enddataintelligencesolutions.
www.it-ebooks.info
IndexA
Abstractsyntaxtree(AST)about/TheEXPLAINstatement
ACLsonHDFS,URL/Storage-basedmode
AdvancedEncryptionStandard(AES)URL/Encryption
aggregatefunctions/Operatorsandfunctionsaggregation
dataaggregation/Basicaggregation–GROUPBYwithoutGROUPBYcolumns/Basicaggregation–GROUPBYwithGROUPBYcolumns/Basicaggregation–GROUPBYadvanced/Advancedaggregation–GROUPINGSETS,Advancedaggregation–ROLLUPandCUBEROLLUPstatement/Advancedaggregation–ROLLUPandCUBECUBEstatement/Advancedaggregation–ROLLUPandCUBEcondition,HAVINGstatement/Aggregationcondition–HAVING
AmazonEMRURL/StartingHiveinthecloud
analyticfunctionsabout/AnalyticfunctionsFunction(arg1,…,argn)/AnalyticfunctionsStandardaggregations/AnalyticfunctionsRANK/AnalyticfunctionsDENSE_RANK/AnalyticfunctionsROW_NUMBER/AnalyticfunctionsCUME_DIST/AnalyticfunctionsPERCENT_RANK/AnalyticfunctionsNTILE/AnalyticfunctionsLEADfunction/AnalyticfunctionsLAGfunction/AnalyticfunctionsFIRST_VALUE/AnalyticfunctionsLAST_VALUE/Analyticfunctionswindowexpressions/Analyticfunctions
ANALYZEstatementabout/TheANALYZEstatement
ANTLRURL/TheEXPLAINstatement
Apacheused,forinstallingHive/InstallingHivefromApache
ApacheHive
www.it-ebooks.info
Wiki,URL/UsingtheHivecommandlineandBeelineApacheHiveWiki
URL/HBaseApacheJIRAHive-365
URL/UnderstandingHivedatatypesAtomicity,Consistency,Isolation,andDurability(ACID)
about/Transactionsauthentication
about/AuthenticationMetastoreserverauthentication/MetastoreserverauthenticationHiveServer2authentication/HiveServer2authentication
authorizationabout/Authorizationlegacymode/Legacymodestorage-basedmode/Storage-basedmodeSQLstandard-basedmode/SQLstandard-basedmode
AvroURL/SerDe
AvroSerDe/SerDeAzureHDInsightService
URL/StartingHiveinthecloud
www.it-ebooks.info
Bbatchprocessing
about/Batch,real-time,andstreamprocessingBeeline
using/UsingtheHivecommandlineandBeelineURL/UsingtheHivecommandlineandBeelinecommand-linesyntax/UsingtheHivecommandlineandBeeline
bigdataabout/IntroducingbigdataVolume/Introducingbigdatavolume/Introducingbigdatavelocity/Introducingbigdatavariety/Introducingbigdataveracity/Introducingbigdatavariability/Introducingbigdatavolatility/Introducingbigdatavisualization/Introducingbigdatavalue/Introducingbigdata
blocksampling/Samplingbucketmapjoin/Bucketmapjoinbuckets
about/Hivebucketsnumber/Hivebuckets
buckettablesabout/Buckettables
buckettablesampling/Sampling
www.it-ebooks.info
Ccloud
Hive,starting/StartingHiveinthecloudCloudera
URL/StartingHiveinthecloudabout/JDBC/ODBCconnector
ClouderaDistributedHadoop(CDH)URL/InstallingHivefromvendorpackages
CLUSTERBY/ORDERandSORTcollectionfunctions/Operatorsandfunctionscollectionitemdelimiter/UnderstandingHivedatatypesColumnarSerDe/SerDeCombineFileInputFormat/Storageoptimizationcommonjoin,joinoptimization/CommonjoinCommonTableExpression(CTE)/HiveinternalandexternaltablesCommonTableExpression(CTE)/Hiveinternalandexternaltablescompression/Compressionconditionalfunctions/OperatorsandfunctionsCost-BasedOptimizer(CBO)
about/TheANALYZEstatementCostBaseOptimizer(CBO)/HiveroadmapCREATETABLE/HiveinternalandexternaltablesCreatethetableasselect(CTAS)/HiveinternalandexternaltablesCROSSJOINstatement/TheOUTERJOINandCROSSJOINstatementsCUBEstatement
about/Advancedaggregation–ROLLUPandCUBE
www.it-ebooks.info
Ddataaggregation
about/Basicaggregation–GROUPBYdatabase,Hive
about/Hivedatabasedataexchange
LOADkeyword/Dataexchange–LOADINSERTkeyword/Dataexchange–INSERT
dataexchangeEXPORTstatement/Dataexchange–EXPORTandIMPORTIMPORTstatement/Dataexchange–EXPORTandIMPORT
datafileoptimizationabout/Datafileoptimizationfileformat/Fileformatcompression/Compressionstorageoptimization/Storageoptimization
datatypeconversionsabout/Datatypeconversionsprimitivetypeconversion/Datatypeconversionsexplicittypeconversion/Datatypeconversions
datatypefunctionstips,complex/Operatorsandfunctionsdatatypes,Hive
about/UnderstandingHivedatatypesTINYINT/UnderstandingHivedatatypesSMALLINT/UnderstandingHivedatatypesINT/UnderstandingHivedatatypesBIGINT/UnderstandingHivedatatypesFLOAT/UnderstandingHivedatatypesDOUBLE/UnderstandingHivedatatypesDECIMAL/UnderstandingHivedatatypesBINARY/UnderstandingHivedatatypesBOOLEAN/UnderstandingHivedatatypesSTRING/UnderstandingHivedatatypesCHAR/UnderstandingHivedatatypesVARCHAR/UnderstandingHivedatatypesDATE/UnderstandingHivedatatypesTIMESTAMP/UnderstandingHivedatatypes
datefunctions/Operatorsandfunctionsdatefunctiontips/Operatorsandfunctionsdelimiters
rowdelimiter/UnderstandingHivedatatypescollectionitemdelimiter/UnderstandingHivedatatypesmapkeydelimiter/UnderstandingHivedatatypes
www.it-ebooks.info
deployment/DevelopmentanddeploymentDerby
URL/InstallingHivefromApachedesignoptimization
about/Designoptimizationpartitiontables/Partitiontablesbuckettables/Buckettablesindex/Index
development/DevelopmentanddeploymentDirectedAcyclicalGraph(DAG)/Ooziedirectedacyclicgraphs(DAGs)/IndexDISTRIBUTEBY/ORDERandSORT
www.it-ebooks.info
Eencryption
about/EncryptionEXPLAINstatement
about/TheEXPLAINstatementEXTENDEDkeyword/TheEXPLAINstatementDEPENDENCYkeyword/TheEXPLAINstatementAUTHORIZATIONkeyword/TheEXPLAINstatement
explicittypeconversion/DatatypeconversionsEXPORTstatement/Dataexchange–EXPORTandIMPORTexternaltables
about/Hiveinternalandexternaltables/Hiveinternalandexternaltables
www.it-ebooks.info
Ffileformat,datafileoptimization
about/FileformatTEXTFILE/FileformatSEQUENCEFILE/FileformatRCFILE/FileformatOptimizedRowColumnar(ORC)/FileformatPARQUET/Fileformat
Flume/OverviewoftheHadoopecosystemfunctions
about/Operatorsandfunctionsmathematicalfunctions/Operatorsandfunctionscollectionfunctions/Operatorsandfunctionstypeconversionfunctions/Operatorsandfunctionsdatefunctions/Operatorsandfunctionsconditionalfunctions/Operatorsandfunctionsstringfunctions/Operatorsandfunctionsaggregatefunctions/Operatorsandfunctionstable-generatingfunctions/Operatorsandfunctionscustomized/Operatorsandfunctionscomplexdatatypefunctionstips/Operatorsandfunctionsdatefunctiontips/OperatorsandfunctionsCASE,fordatatypes/Operatorsandfunctionsparserandsearchtips/Operatorsandfunctionsvirtualcolumns/Operatorsandfunctions
www.it-ebooks.info
GGenericUDAF
URL/TheUDAFcodetemplateGROUPINGSETSkeyword
about/Advancedaggregation–GROUPINGSETS
www.it-ebooks.info
HHadoop
versusrelationaldatabase/RelationalandNoSQLdatabaseversusHadoopversusNoSQLdatabase/RelationalandNoSQLdatabaseversusHadoop
HadoopArchiveandHAR/Storageoptimization
HadoopArchiveFile(HAR)/FileformatHadoopecosystem
about/OverviewoftheHadoopecosystemHAVINGstatement
about/Aggregationcondition–HAVINGHBase
about/HBaseURL/HBasetable,creatinginHQL/HBase
HBaseSerDe/SerDeHCatalog
about/HCatalogURL/HCatalog
HDFSabout/Batch,real-time,andstreamprocessing,OverviewoftheHadoopecosystem
HDFSfederation/StorageoptimizationHive
about/Hiveoverviewinstalling,fromApache/InstallingHivefromApacheURL/InstallingHivefromApacheinstalling,fromvendorpackages/InstallingHivefromvendorpackagesstarting,incloud/StartingHiveintheclouddatatypes/UnderstandingHivedatatypescomplextypes/UnderstandingHivedatatypestypes/UnderstandingHivedatatypesdatabase/Hivedatabaseinternaltables/Hiveinternalandexternaltablesexternaltables/Hiveinternalandexternaltablespartitions/Hivepartitionsbuckets/Hivebucketsviews/Hiveviewsperformanceutilities/Performanceutilities
Hive,complextypesARRAY/UnderstandingHivedatatypesMAP/UnderstandingHivedatatypesSTRUCT/UnderstandingHivedatatypes
www.it-ebooks.info
NAMEDSTRUCT/UnderstandingHivedatatypesUNION/UnderstandingHivedatatypes
Hive-integrateddevelopmentenvironment(IDE)about/TheHive-integrateddevelopmentenvironment
hive.map.aggrproperty/Basicaggregation–GROUPBYHiveCLI
command-linesyntax/UsingtheHivecommandlineandBeelineURL/UsingtheHivecommandlineandBeeline
Hivecommandlineusing/UsingtheHivecommandlineandBeeline
HiveDataDefinitionLanguage(DDL)about/HiveDataDefinitionLanguage
HivejoinoptimizationURL/Skewjoin
Hiveroadmapabout/Hiveroadmap
HiveServer2URL/UsingtheHivecommandlineandBeeline
HiveServer2authenticationnoneauthentication/HiveServer2authenticationKerberosauthentication/HiveServer2authenticationLDAPauthentication/HiveServer2authenticationpluggablecustomauthentication/HiveServer2authenticationPluggableAuthenticationModules(PAM)authentication/HiveServer2authentication
HiveWikiURL/Operatorsandfunctions
HortonworksURL/JDBC/ODBCconnector
HQLabout/Hiveoverview
HueURL/TheHive-integrateddevelopmentenvironment,Hueabout/Hue
www.it-ebooks.info
IImpala
URL/AshorthistoryIMPORTstatement/Dataexchange–EXPORTandIMPORTindex
about/IndexINNERJOINstatement/TheINNERJOINstatementINSERTkeyword/Dataexchange–INSERTinternaltables
about/Hiveinternalandexternaltables/Hiveinternalandexternaltables
www.it-ebooks.info
JJavaIDE
URL/DevelopmentanddeploymentJavaVirtualMachine(JVM)/Batch,real-time,andstreamprocessingjavax.scriptAPI
URL/User-definedfunctionsJDBC/ODBCconnector
about/JDBC/ODBCconnectorjobandqueryoptimization
about/Jobandqueryoptimizationlocalmode/LocalmodeJVMreuse/JVMreuseparallelexecution/Parallelexecution
joinoptimizationabout/Joinoptimizationcommonjoin/Commonjoinmapjoin/Mapjoinbucketmapjoin/BucketmapjoinSortmergebucket(SMB)join/Sortmergebucket(SMB)joinSortmergebucketmap(SMBM)join/Sortmergebucketmap(SMBM)joinskewjoin/Skewjoin
JSONSerDeURL/SerDeabout/SerDe
JVMreuse,jobandqueryoptimization/JVMreuse
www.it-ebooks.info
KKerberos
about/AuthenticationKerberosauthentication/HiveServer2authenticationKeyDistributionCenter(KDC)/Authentication
www.it-ebooks.info
LLazySimpleSerDe/SerDeLDAPauthentication/HiveServer2authenticationlegacymode,authorization
about/LegacymodeLiveLongAndProcess(LLAP)/HiveroadmapLOADkeyword/Dataexchange–LOADlocalmode,jobandqueryoptimization/Localmode
www.it-ebooks.info
Mmapjoin,joinoptimization/MapjoinMAPJOINstatement/SpecialJOIN–MAPJOINmapkeydelimiter/UnderstandingHivedatatypesmathematicalfunctions/OperatorsandfunctionsMaven
URL/Developmentanddeploymentmetastore/HiveoverviewMetastoreserverauthentication
about/MetastoreserverauthenticationMITKerberos
URL/AuthenticationMySQL
URL/InstallingHivefromApache
www.it-ebooks.info
Nnoneauthentication/HiveServer2authenticationNoSQLdatabase
versusHadoop/RelationalandNoSQLdatabaseversusHadoop
www.it-ebooks.info
OOozie
about/OozieURL/Ooziecontrolflownode/Oozieactionnode/Oozie
OpenCSVSerDe/SerDeoperators
about/OperatorsandfunctionsOptimizedRowColumnar(ORC)/Index,FileformatOptimizedRowColumnar(ORC)file
about/TransactionsORDERBY(ASC|DESC)keyword/ORDERandSORTORDERkeyword/ORDERandSORTOUTERJOINstatement/TheOUTERJOINandCROSSJOINstatementsOutOfMemory(OOM)exceptions/TheINNERJOINstatement
www.it-ebooks.info
Pparallelexecution,jobandqueryoptimization/ParallelexecutionParquetHiveSerDe/SerDeparserandsearchtips/OperatorsandfunctionsPARTITIONBYstatement/Analyticfunctionspartitions
about/Hivepartitionspartitiontables
bydateandtime/Partitiontablesbylocations/Partitiontablesbybusinesslogics/Partitiontables
personalidentityinformation(PII)about/Encryption
PhoenixURL/HBase
PluggableAuthenticationModules(PAM)authentication/HiveServer2authenticationpluggablecustomauthentication/HiveServer2authenticationPostgreSQL
URL/InstallingHivefromApachePresto
URL/Ashorthistoryprimitivetypeconversion/DatatypeconversionsProcessingElements(PE)/Batch,real-time,andstreamprocessing
www.it-ebooks.info
Rrandomsampling
URL/Samplingreal-timeprocessing
about/Batch,real-time,andstreamprocessingRecordColumnarFile(RCFILE)/FileformatRegexSerDe/SerDerelationaldatabase
versusHadoop/RelationalandNoSQLdatabaseversusHadoopROLLUPstatement
about/Advancedaggregation–ROLLUPandCUBErowdelimiter/UnderstandingHivedatatypes
www.it-ebooks.info
Ssampling
about/Samplingrandomsampling/Samplingbuckettablesampling/Samplingblocksampling/Sampling
SELECT*statement/TheSELECTstatementSELECTstatement/TheSELECTstatementSentry
URL/SQLstandard-basedmodeSequenceFileformat/StorageoptimizationSerDe
about/SerDedata,reading/SerDedata,writing/SerDeLazySimpleSerDe/SerDeColumnarSerDe/SerDeRegexSerDe/SerDeHBaseSerDe/SerDeAvroSerDe/SerDeParquetHiveSerDe/SerDeOpenCSVSerDe/SerDeJSONSerDe/SerDe
SHOWTRANSACTIONScommand/TransactionsSimpleAuthenticationandSecurityLayer(SASL)framework/Metastoreserverauthenticationskewjoin/SkewjoinSORTBY(ASC|DESC)keyword/ORDERandSORTSORTkeyword/ORDERandSORTsortmergebucket(SMB)join/Sortmergebucket(SMB)joinsortmergebucketmap(SMBM)join/Sortmergebucketmap(SMBM)joinSpark/OverviewoftheHadoopecosystemSQLLine
URL/UsingtheHivecommandlineandBeelineSQLstandard-basedmode,authorization
about/SQLstandard-basedmodeSqoop/OverviewoftheHadoopecosystemstagedependencies
about/TheEXPLAINstatementstageplans
about/TheEXPLAINstatementstorage-basedmode,authorization
about/Storage-basedmode
www.it-ebooks.info
storageoptimization/StorageoptimizationStorm
URL/Ashorthistory,Batch,real-time,andstreamprocessingstreaming
about/Streamingstreamprocessing
about/Batch,real-time,andstreamprocessingstringfunctions/OperatorsandfunctionsStructuredQueryLanguage(SQL)
about/Ashorthistory
www.it-ebooks.info
Ttable-generatingfunctions/OperatorsandfunctionsTez/OverviewoftheHadoopecosystem
about/IndexURL/Index
transactionsabout/Transactions
typeconversionfunctions/Operatorsandfunctions
www.it-ebooks.info
UUDAF
code,template/TheUDAFcodetemplateUDAFs
about/User-definedfunctionsUDF
code,template/TheUDFcodetemplateUDFs
about/User-definedfunctionsUDTF
code,template/TheUDTFcodetemplateUDTFs
about/User-definedfunctionsUniformResourceIdentifier(URI)/Dataexchange–LOADUNIONALLstatement/Setoperation–UNIONALL
www.it-ebooks.info
Vvalue/Introducingbigdatavariability/Introducingbigdatavariety/IntroducingbigdataVectorizationoptimization
about/IndexURL/Index
velocity/Introducingbigdatavendorpackages
used,forinstallingHive/InstallingHivefromvendorpackagesveracity/Introducingbigdataviews
about/Hiveviewsaltering/Hiveviewsredefining/Hiveviewsdropping/Hiveviews
virtualcolumns/Operatorsandfunctionsvisualization/Introducingbigdatavolatility/Introducingbigdatavolume/Introducingbigdata
www.it-ebooks.info
WWHEREclauses
subqueries,restrictions/TheSELECTstatementwindowexpressions
BETWEEN…ANDclause/AnalyticfunctionsNPRECEDINGorFOLLOWING/AnalyticfunctionsUNBOUNDEDPRECEDING/AnalyticfunctionsUNBOUNDEDFOLLOWING/AnalyticfunctionsUNBOUNDEDPRECEDINGANDUNBOUNEDFOLLOWING/AnalyticfunctionsCURRENTROW/AnalyticfunctionsURL/Analyticfunctions
www.it-ebooks.info
ZZooKeeper
about/ZooKeeperURL/ZooKeepersharedlock/ZooKeeperexclusivelock/ZooKeeperforHivelocks,URL/ZooKeeper
www.it-ebooks.info