apache hive essentials - droppdf2.droppdf.com/files/i9ekz/apache-hive-essentials.pdf · neha...

www.it-ebooks.info

http://www.it-ebooks.info/

ApacheHiveEssentials

www.it-ebooks.info


TableofContents


Credits

AbouttheAuthor

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.OverviewofBigDataandHive

Ashorthistory

Introducingbigdata

RelationalandNoSQLdatabaseversusHadoop

Batch,real-time,andstreamprocessing

OverviewoftheHadoopecosystem

Hiveoverview

Summary

2.SettingUptheHiveEnvironment

InstallingHivefromApache

www.it-ebooks.info


InstallingHivefromvendorpackages

StartingHiveinthecloud

UsingtheHivecommandlineandBeeline

TheHive-integrateddevelopmentenvironment

Summary

3.DataDefinitionandDescription

UnderstandingHivedatatypes

Datatypeconversions

HiveDataDefinitionLanguage

Hivedatabase

Hiveinternalandexternaltables

Hivepartitions

Hivebuckets

Hiveviews

Summary

4.DataSelectionandScope

TheSELECTstatement

TheINNERJOINstatement

TheOUTERJOINandCROSSJOINstatements

SpecialJOIN–MAPJOIN

Setoperation–UNIONALL

Summary

5.DataManipulation

Dataexchange–LOAD

Dataexchange–INSERT

Dataexchange–EXPORTandIMPORT

ORDERandSORT

Operatorsandfunctions

Transactions

Summary

6.DataAggregationandSampling

www.it-ebooks.info


Basicaggregation–GROUPBY

Advancedaggregation–GROUPINGSETS

Advancedaggregation–ROLLUPandCUBE

Aggregationcondition–HAVING

Analyticfunctions

Sampling

Summary

7.PerformanceConsiderations

Performanceutilities

TheEXPLAINstatement

TheANALYZEstatement

Designoptimization

Partitiontables

Buckettables

Index

Datafileoptimization

Fileformat

Compression

Storageoptimization

Jobandqueryoptimization

Localmode

JVMreuse

Parallelexecution

Joinoptimization

Commonjoin

Mapjoin

Bucketmapjoin

Sortmergebucket(SMB)join

Sortmergebucketmap(SMBM)join

Skewjoin

Summary

www.it-ebooks.info


8.ExtensibilityConsiderations

User-definedfunctions

TheUDFcodetemplate

TheUDAFcodetemplate

TheUDTFcodetemplate

Developmentanddeployment

Streaming

SerDe

Summary

9.SecurityConsiderations

Authentication

Metastoreserverauthentication

HiveServer2authentication

Authorization

Legacymode

Storage-basedmode

SQLstandard-basedmode

Encryption

Summary

10.WorkingwithOtherTools

JDBC/ODBCconnector

HBase

Hue

HCatalog

ZooKeeper

Oozie

Hiveroadmap

Summary

Index

www.it-ebooks.info


www.it-ebooks.info


ApacheHiveEssentialsCopyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:February2015

Productionreference:1210215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78355-857-5

www.packtpub.com

www.it-ebooks.info

http://www.packtpub.com


www.it-ebooks.info


CreditsAuthor

DayongDu

Reviewers

PuneethaBM

HamzehKhazaei

NitinPradeepKumar

BalaswamyVaddeman

CommissioningEditor

AshwinNair

AcquisitionEditor

ShaonBasu

ContentDevelopmentEditor

MerwynD’souza

TechnicalEditor

TaabishKhan

CopyEditors

SameenSiddiqui

LaxmiSubramanian

ProjectCoordinator

NehaBhatnagar

Proofreaders

PaulHindle

JonathanTodd

Indexer

MonicaAjmeraMehta

ProductionCoordinator

AparnaBhagat

CoverWork

AparnaBhagat

www.it-ebooks.info


www.it-ebooks.info


AbouttheAuthorDayongDuisabigdatapractitioner,leader,anddeveloperwithexpertiseintechnologyconsulting,designing,andimplementingenterprisebigdatasolutions.Withmorethan10yearsofexperienceinenterprisedatawarehouse,businessintelligence,andbigdataandanalytics,hehasprovidedhisdataintelligenceexpertiseinvariousindustries,suchasmedia,travel,telecommunications,andsoon.HeiscurrentlyworkingwithQuickPlayMediainToronto,Canada,tobuildenterprisebigdataintelligencereportingforonlinemediaservicesandcontentproviders.Hehasamaster’sdegreeincomputersciencefromDalhousieUniversity,andheholdstheClouderaCertifiedDeveloperforApacheHadoopcertification.

Iwouldliketosincerelythankmywife,Joice,anddaughter,Elaine,fortheirsacrificesandencouragementduringthisjourney.Also,Iwouldliketothankmyparentsfortheirsupportduringthetimeofwritingthisbook.

IwouldalsoliketothankeveryoneatPacktPublishingandthetechnicalreviewersfortheirvaluablehelp,guidance,andfeedbackonmybook.

www.it-ebooks.info


www.it-ebooks.info


AbouttheReviewersPuneethaBMisasoftwareengineer,dataenthusiast,andtechnicalblogger.Herresearchinterestsincludebigdata,cloudcomputing,machinelearning,andNoSQLdatabases.Sheisalsoaprofessionalsoftwareengineerwithmorethan2yearsofworkingexperience.Sheholdsamaster’sdegreeincomputerapplicationsfromP.E.S.InstituteofTechnology.Otherthanprogramming,sheenjoyspaintingandlisteningtomusic.Youcanlearnmorefromherblog(http://blog.puneethabm.in/)andLinkedInprofile(https://www.linkedin.com/in/puneethabm).

IoweagreatdealtoProf.Dr.RamRustagiforbeingarolemodelinmylifeandforhiszealousinspiration.Iwouldliketothankmybrother,NischithB.M.,forsupportingmeineverythingIdo.IwouldalsoliketothankPacktPublishinganditsstaffforprovidingtheopportunitytocontributetothisbook.

HamzehKhazaeiisapostdoctoralresearchscientistatIBMCanadaResearchandDevelopmentCentre.HereceivedhisPhDdegreeincomputersciencefromUniversityofManitoba,Winnipeg,Manitoba,Canada(2009–2012).Earlier,hereceivedbothhisBScandMScdegreesincomputersciencefromAmirkabirUniversityofTechnology,Tehran,Iran(2000–2008).HeisalsoasessionalinstructorintheComputerSciencedepartmentatRyersonUniversity(http://scs.ryerson.ca/~hkhazaei).Heteachessoftwareengineeringtofourthyearundergraduatestudents.Hisresearchareaincludesbigdataanalytics,cloudcomputinginfrastructure,analyticsasaservice,andmodelingofcomputingsystems.

Iwouldliketothankmydearwifeforherperpetualsupportinallmyendeavors.

NitinPradeepKumarisapassionatedeveloperwithextensiveexperienceandoodlesofinterestinemergingtechnologiessuchasthecloudandmobile.HeiscurrentlyacloudqualityengineeratAppcelerator,aleadingSiliconValley-basedstart-upthatprovidesanMBaaSplatformpurpose-builtformobileandclouddevelopment.Beforethisstint,hestudiedattheNationalUniversityofSingaporetowardamaster’sdegreeinknowledgeengineering,whichinvolvesbuildingintelligentsystemsusingcutting-edgeartificialintelligenceanddata-miningtechniques.Heenjoysthestart-upenvironmentandhasworkedwithtechnologiessuchasHadoop,Hive,anddatawarehousing.HelivesinSingaporeandspendshissparecyclesplayingretroPCgamesonhismobileandlearningMuayThai.

Iwouldliketothankmyfamily,friends,andmywonderfulbrother,Nivin,forsupportingmeinallmyendeavors.

BalaswamyVaddemanisaHadoophackathonwinnerforHyderabadin2013.HeisoneofthetopcontributorsontheHivetagathttp://www.stackoverflow.com.Heisabigdataprofessionalwith3yearsofexperience.Heiswellknownfortrainingpeopleonbigdata/Hadoop.Sofar,hehasdeliveredsixbigdataprojects.HeisaJava/J2EEexpertwith8yearsofITexperienceand5yearsofRDBMSexperience.HeisanautomationexpertonUnix-basedsystemsusingShellscripting.Hehasexperienceinsettingupteamsandbringingthemuptospeedonbigdataprojects.HeisanactiveparticipantinHadoop/big

www.it-ebooks.info

http://blog.puneethabm.in/

https://www.linkedin.com/in/puneethabm

http://scs.ryerson.ca/~hkhazaei

http://www.stackoverflow.com


dataforums.

Iwouldliketothankmywife,Radha,myson,Pandu,andmydaughter,Bubly,fortheircooperationincompletingthisbook.

www.it-ebooks.info


www.it-ebooks.info


www.PacktPub.com

www.it-ebooks.info


Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

www.it-ebooks.info

http://www.PacktPub.com


mailto:[email protected]


https://www2.packtpub.com/books/subscription/packtlib


Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

www.it-ebooks.info


FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandview9entirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.

Idedicatethisbooktomydaughter

www.it-ebooks.info



www.it-ebooks.info


PrefaceWithanincreasinginterestinbigdataanalysis,HiveoverHadoopbecomesacutting-edgedatasolutionforstoring,computing,andanalyzingbigdata.TheSQL-likesyntaxmakesHiveeasiertolearnandpopularlyacceptedasastandardforinteractiveSQLqueriesoverbigdata.ThevarietyoffeaturesavailablewithinHiveprovidesuswiththecapabilityofdoingcomplexbigdataanalysiswithoutadvancedcodingskills.ThematurityofHiveletsitgraduallymergeandshareitsvaluablearchitectureandfunctionalitiesacrossdifferentcomputingframeworksbeyondHadoop.

ApacheHiveEssentialspreparesyourjourneytobigdatabycoveringtheintroductionofbackgroundsandconceptsinthebigdatadomainalongwiththeprocessofsettingupandgettingfamiliarwithyourHiveworkingenvironmentinthefirsttwochapters.Inthenextfourchapters,thebookguidesyouthroughdiscoveringandtransformingthevaluebehindbigdatabyexamplesandskillsofHivequerylanguages.Inthelastfourchapters,thebookhighlightswell-selectedandadvancedtopics,suchasperformance,security,andextensionsasexcitingadventuresforthisworthwhilebigdatajourney.

www.it-ebooks.info


WhatthisbookcoversChapter1,OverviewofBigDataandHive,introducestheevolutionofbigdata,theHadoopecosystem,andHive.YouwillalsolearntheHivearchitectureandtheadvantagesofusingHiveinbigdataanalysis.

Chapter2,SettingUptheHiveEnvironment,describestheHiveenvironmentsetupandconfiguration.ItalsocoversusingHivethroughthecommandlineanddevelopmenttools.

Chapter3,DataDefinitionandDescription,introducesthebasicdatatypesanddatadefinitionlanguagefortables,partitions,buckets,andviewsinHive.

Chapter4,DataSelectionandScope,showsyouwaystodiscoverthedatabyquerying,linking,andscopingthedatainHive.

Chapter5,DataManipulation,describestheprocessofexchanging,moving,sorting,andtransformingthedatainHive.

Chapter6,DataAggregationandSampling,explainshowtodoaggregationandsampleusingaggregationfunctions,analyticfunctions,windowing,andsampleclauses.

Chapter7,PerformanceConsiderations,introducesthebestpracticesofperformanceconsiderationsintheaspectsofdesign,fileformat,compression,storage,query,andjob.

Chapter8,ExtensibilityConsiderations,describeshowtoextendHivebycreatinguser-definedfunctions,streaming,serializers,anddeserializers.

Chapter9,SecurityConsiderations,introducestheareaofHivesecurityintermsofauthentication,authorization,andencryption.

Chapter10,WorkingwithOtherTools,discusseshowHiveworkswithotherbigdatatools.ItalsoreviewsthekeymilestonesofHivereleases.

www.it-ebooks.info


www.it-ebooks.info


WhatyouneedforthisbookYouwillneedtoinstallbothHadoopandHivetoruntheexamplesinthisbook.ThescriptsinthisbookwerewrittenandtestedwithClouderaDistributedHadoop(CDH)v5.3(containsHivev0.13.xandHadoopv2.5.0),HortonworksDataPlatform(HDP)v2.2(containsHivev0.14.0andHadoopv2.6.0),andApacheHive1.0.0(withHadoop1.2.1)inpseudo-distributedmode.However,themajorityofthescriptswillalsorunonthepreviousversionsofHadoopandHive.ThefollowingaretheothersoftwareapplicationsyoumayneedforabetterunderstandingoftheHive-relatedtoolsmentionedinthebook.ThesetoolsarealsoavailableintheCDHorHDPpackages.

Hue2.2.0andaboveHBase0.98.4Oozie4.0.0andaboveZookeeper3.4.5Tez0.6.0

www.it-ebooks.info


www.it-ebooks.info


WhothisbookisforIfyouareadataanalyst,developer,anduserwhowantstouseHivetoexploreandanalyzedatainHadoop,thisisthebookforyou.Whetheryouarenewtobigdataoranexpert,youwillbeabletomasterboththebasicandtheadvancedfeaturesofHive.SinceHiveisanSQL-likelanguage,somepreviousexperiencewiththeSQLlanguageanddatabaseisusefultohaveabetterunderstandingofthisbook.

www.it-ebooks.info


www.it-ebooks.info


ConventionsInthisbook,youwillfindanumberoftextstylesthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestylesandanexplanationoftheirmeaning.

Codewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandlesareshownasfollows:“Aggregatefunctioncanbeusedwithotheraggregatefunctionsinthesameselectstatement.”

Ablockofcodeissetasfollows:

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://myhost:3306/hive?createDatabase

IfNotExist=true</value>

<description>JDBCconnectstringforaJDBCmetastore</description>

</property>

Whenwewishtodrawyourattentiontoaparticularpartofacodeblock,therelevantlinesoritemsaresetinbold:

customAuthenticator.java

packagecom.packtpub.hive.essentials.hiveudf;

importjava.util.Hashtable;

importjavax.security.sasl.AuthenticationException;

importorg.apache.hive.service.auth.PasswdAuthenticationProvider;

Anycommand-lineinputoroutputiswrittenasfollows:

bash-4.1$hdfsdfs–mkdir/tmp

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,forexample,inmenusordialogboxes,appearinthetextlikethis:“ClickontheOKbuttonandrestartOracleSQLDeveloper.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

www.it-ebooks.info


www.it-ebooks.info


ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedordisliked.Readerfeedbackisimportantforusasithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.

Tosendusgeneralfeedback,simplye-mail<[email protected]>,andmentionthebook’stitleinthesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideatwww.packtpub.com/authors.

www.it-ebooks.info


http://www.packtpub.com/authors


www.it-ebooks.info


CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

www.it-ebooks.info


DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

www.it-ebooks.info


http://www.packtpub.com/support


ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyoucouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrataSubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedtoourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthattitle.

Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/content/supportandenterthenameofthebookinthesearchfield.TherequiredinformationwillappearundertheErratasection.

www.it-ebooks.info

http://www.packtpub.com/submit-errata

https://www.packtpub.com/books/content/support


PiracyPiracyofcopyrightedmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.IfyoucomeacrossanyillegalcopiesofourworksinanyformontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthorsandourabilitytobringyouvaluablecontent.

www.it-ebooks.info



QuestionsIfyouhaveaproblemwithanyaspectofthisbook,youcancontactusat<[email protected]>,andwewilldoourbesttoaddresstheproblem.

www.it-ebooks.info



www.it-ebooks.info


Chapter1.OverviewofBigDataandHiveThischapterisanoverviewofbigdataandHive,especiallyintheHadoopecosystem.Itbrieflyintroducestheevolutionofbigdatasothatreadersknowwheretheyareinthejourneyofbigdataandfindtheirpreferredareasinfuturelearning.ThischapteralsocovershowHivehasbecomeoneoftheleadingtoolsinbigdatawarehousingandwhyHiveisstillcompetitive.

Inthischapter,wewillcoverthefollowingtopics:

AshorthistoryfromdatabaseanddatawarehousetobigdataIntroducingbigdataRelationalandNoSQLdatabasesversusHadoopBatch,real-time,andstreamprocessingHadoopecosystemoverviewHiveoverview

www.it-ebooks.info


AshorthistoryInthe1960s,whencomputersbecameamorecost-effectiveoptionforbusinesses,peoplestartedtousedatabasestomanagedata.Lateron,inthe1970s,relationaldatabasesbecamemorepopulartobusinessneedssincetheyconnectedphysicaldatatothelogicalbusinesseasilyandclosely.Inthenextdecade,aroundthe1980s,StructuredQueryLanguage(SQL)becamethestandardquerylanguagefordatabases.TheeffectivenessandsimplicityofSQLmotivatedlotsofpeopletousedatabasesandbroughtdatabasesclosertoawiderangeofusersanddevelopers.Soon,itwasobservedthatpeopleuseddatabasesfordataapplicationandmanagementandthiscontinuedforalongperiodoftime.

Onceplentyofdatawascollected,peoplestartedtothinkabouthowtodealwiththeolddata.Then,thetermdatawarehousingcameupinthe1990s.Fromthattimeonwards,peoplestartedtodiscusshowtoevaluatethecurrentperformancebyreviewingthehistoricaldata.Variousdatamodelsandtoolswerecreatedatthattimeforhelpingenterprisestoeffectivelymanage,transform,andanalyzethehistoricaldata.Traditionalrelationaldatabasesalsoevolvedtoprovidemoreadvancedaggregationandanalyzedfunctionsaswellasoptimizationsfordatawarehousing.TheleadingquerylanguagewasstillSQL,butitwasmoreintuitiveandpowerfulascomparedtothepreviousversions.Thedatawasstillwellstructuredandthemodelwasnormalized.Asweenteredthe2000s,theInternetgraduallybecamethetopmostindustryforthecreationofthemajorityofdataintermsofvarietyandvolume.Newertechnologies,suchassocialmediaanalytics,webmining,anddatavisualizations,helpedlotsofbusinessesandcompaniesdealwithmassiveamountsofdataforabetterunderstandingoftheircustomers,products,competition,aswellasmarkets.Thedatavolumegrewandthedataformatchangedfasterthaneverbefore,whichforcedpeopletosearchfornewsolutions,especiallyfromtheacademicandopensourceareas.Asaresult,bigdatabecameahottopicandachallengingfieldformanyresearchersandcompanies.

However,ineverychallengethereliesgreatopportunity.Hadoopwasoneoftheopensourceprojectsearningwideattentionduetoitsopensourcelicenseandactivecommunities.Thiswasoneofthefewtimesthatanopensourceprojectledtothechangesintechnologytrendsbeforeanycommercialsoftwareproducts.Soonafter,theNoSQLdatabaseandreal-timeandstreamcomputing,asfollowers,quicklybecameimportantcomponentsforbigdataecosystems.Armedwiththesebigdatatechnologies,companieswereabletoreviewthepast,evaluatethecurrent,andalsopredictthefuture.Aroundthe2010s,timetomarketbecamethekeyfactorformakingbusinesscompetitiveandsuccessful.Whenitcomestobigdataanalysis,peoplecouldnotwaittoseethereportsorresults.Ashortdelaycouldmakeagreatdifferencewhenmakingimportantbusinessdecisions.Decisionmakerswantedtoseethereportsorresultsimmediatelywithinafewhours,minutes,orevenpossiblysecondsinafewcases.Real-timeanalyticaltools,suchasImpala(http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html),Presto(http://prestodb.io/),Storm(https://storm.apache.org/),andsoon,makethispossibleindifferentways.

www.it-ebooks.info

http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html

http://prestodb.io/

https://storm.apache.org/


www.it-ebooks.info


IntroducingbigdataBigdataisnotsimplyabigvolumeofdata.Here,theword“Big”referstothebigscopeofdata.Awell-knownsayinginthisdomainistodescribebigdatawiththehelpofthreewordsstartingwiththeletterV.Theyarevolume,velocity,andvariety.Buttheanalyticalanddatascienceworldhasseendatavaryinginotherdimensionsinadditiontothefundament3Vsofbigdatasuchasveracity,variability,volatility,visualization,andvalue.ThedifferentVsmentionedsofarareexplainedasfollows:

Volume:Thisreferstotheamountofdatageneratedinseconds.90percentoftheworld’sdatatodayhasbeencreatedinthelasttwoyears.Sincethattime,thedataintheworlddoubleseverytwoyears.Suchbigvolumesofdataismainlygeneratedbymachines,networks,socialmedia,andsensors,includingstructured,semi-structured,andunstructureddata.Velocity:Thisreferstothespeedinwhichthedataisgenerated,stored,analyzed,andmovedaround.WiththeavailabilityofInternet-connecteddevices,wirelessorwired,machinesandsensorscanpassontheirdataimmediatelyassoonasitiscreated.Thisleadstoreal-timestreamingandhelpsbusinessestomakevaluableandfastdecisions.Variety:Thisreferstothedifferentdataformats.Datausedtobestoredastext,dat,andcsvfromsourcessuchasfilesystems,spreadsheets,anddatabases.Thistypeofdatathatresidesinafixedfieldwithinarecordorfileiscalledstructureddata.Nowadays,dataisnotalwaysinthetraditionalformat.Thenewersemi-structuredorunstructuredformsofdatacanbegeneratedusingvariousmethodssuchase-mails,photos,audio,video,PDFs,SMSes,orevensomethingwehavenoideaabout.Thesevarietiesofdataformatscreateproblemsforstoringandanalyzingdata.Thisisoneofthemajorchallengesweneedtoovercomeinthebigdatadomain.Veracity:Thisreferstothequalityofdata,suchastrustworthiness,biases,noise,andabnormalityindata.Corruptdataisquitenormal.Itcouldoriginateduetoanumberofreasons,suchastypos,missingoruncommonabbreviation,datareprocessing,systemfailures,andsoon.However,ignoringthismaliciousdatacouldleadtoinaccuratedataanalysisandeventuallyawrongdecision.Therefore,makingsurethedataiscorrectintermsofdataauditionandcorrectionisveryimportantforbigdataanalysis.Variability:Thisreferstothechangingofdata.Itmeansthatthesamedatacouldhavedifferentmeaningsindifferentcontexts.Thisisparticularlyimportantwhencarryingoutsentimentanalysis.Theanalysisalgorithmsareabletounderstandthecontextanddiscovertheexactmeaningandvaluesofdatainthatcontext.Volatility:Thisreferstohowlongthedataisvalidandstored.Thisisparticularlyimportantforreal-timeanalysis.Itrequiresatargetscopeofdatatobedeterminedsothatanalystscanfocusonparticularquestionsandgaingoodperformanceoutoftheanalysis.Visualization:Thisreferstothewayofmakingdatawellunderstood.Visualizationdoesnotmeanordinarygraphsorpiecharts.Itmakesvastamountsofdata

www.it-ebooks.info


comprehensibleinamultidimensionalviewthatiseasytounderstand.Visualizationisaninnovativewaytoshowchangesindata.Itrequireslotsofinteraction,conversations,andjointeffortsbetweenbigdataanalystsandbusinessdomainexpertstomakethevisualizationmeaningful.Value:Thisreferstotheknowledgegainedfromdataanalysisonbigdata.Thevalueofbigdataishoworganizationsturnthemselvesintobigdata-drivencompaniesandusetheinsightfrombigdataanalysisfortheirdecisionmaking.

Insummary,bigdataisnotjustaboutlotsofdata,itisapracticetodiscovernewinsightfromexistingdataandguidetheanalysisforfuturedata.Abig-data-drivenbusinesswillbemoreagileandcompetitivetoovercomechallengesandwincompetitions.

www.it-ebooks.info


www.it-ebooks.info


RelationalandNoSQLdatabaseversusHadoopLet’scomparedifferentdatasolutionswiththewaysoftraveling.Youwillbesurprisedtofindthattheyhavemanysimilarities.Whenpeopletravel,theyeithertakecarsorairplanesdependingonthetraveldistanceandcost.Forexample,whenyoutraveltoVancouverfromToronto,anairplaneisalwaysthefirstchoiceintermsofthetraveltimeversuscost.WhenyoutraveltoNiagaraFallsfromToronto,acarisalwaysagoodchoice.WhenyoutraveltoMontrealfromToronto,somepeoplemayprefertakingacartoanairplane.Thedistanceandcosthereislikethebigdatavolumeandinvestment.Thetraditionalrelationaldatabaseislikethecarinthisexample.TheHadoopbigdatatoolisliketheairplaneinthisexample.Whenyoudealwithasmallamountofdata(shortdistance),arelationaldatabase(likethecar)isalwaysthebestchoicesinceitismorefastandagiletodealwithasmallormoderatesizeofdata.Whenyoudealwithabigamountofdata(longdistance),Hadoop(liketheairplane)isthebestchoicesinceitismorelinear,fast,andstabletodealwiththebigsizeofdata.Onthecontrary,youcandrivefromTorontotoVancouver,butittakestoomuchtime.YoucanalsotakeanairplanefromTorontotoNiagara,butitcouldtakemoretimeandcostwaymorethanifyoutravelbyacar.Inaddition,youmayhaveachoicetoeithertakeashiporatrain.ThisislikeaNoSQLdatabase,whichofferscharactersfrombotharelationaldatabaseandHadoopintermsofgoodperformanceandvariousdataformatsupportforbigdata.

www.it-ebooks.info


www.it-ebooks.info


Batch,real-time,andstreamprocessingBatchprocessingisusedtoprocessdatainbatchesanditreadsdatainput,processesit,andwritesittotheoutput.ApacheHadoopisthemostwell-knownandpopularopensourceimplementationofbatchprocessingandadistributedsystemusingtheMapReduceparadigm.ThedataisstoredinasharedanddistributedfilesystemcalledHadoopDistributedFileSystem(HDFS),dividedintosplits,whicharethelogicaldatadivisionsforMapReduceprocessing.ToprocessthesesplitsusingtheMapReduceparadigm,themaptaskreadsthesplitsandpassesallofitskey/valuepairstoamapfunctionandwritestheresultstointermediatefiles.Afterthemapphaseiscompleted,thereducerreadsintermediatefilesandpassesittothereducefunction.Finally,thereducetaskwritesresultstothefinaloutputfiles.TheadvantagesoftheMapReducemodelincludemakingdistributedprogrammingeasier,near-linearspeedup,goodscalability,aswellasfaulttolerance.Thedisadvantageofthisbatchprocessingmodelisbeingunabletoexecuterecursiveoriterativejobs.Inaddition,theobviousbatchbehavioristhatallinputsmustbereadybymapbeforethereducejobstarts,whichmakesMapReduceunsuitableforonlineandstreamprocessingusecases.

Real-timeprocessingistoprocessdataandgettheresultalmostimmediately.Thisconceptintheareaofreal-timeadhocqueriesoverbigdatawasfirstimplementedinDremelbyGoogle.Itusesanovelcolumnarstorageformatfornestedstructureswithfastindexandscalableaggregationalgorithmsforcomputingqueryresultsinparallelinsteadofbatchsequences.Thesetwotechniquesarethemajorcharactersforreal-timeprocessingandareusedbysimilarimplementations,suchasClouderaImpala,FacebookPresto,ApacheDrill,andHiveonTezpoweredbyStingerwhoseeffortistomakea100xperformanceimprovementoverApacheHive.Ontheotherhand,in-memorycomputingnodoubtoffersothersolutionsforreal-timeprocessing.In-memorycomputingoffersveryhighbandwidth,whichismorethan10gigabytes/second,comparedtoharddisks’200megabytes/second.Also,thelatencyiscomparativelylower,nanosecondsversusmilliseconds,comparedtoharddisks.WiththepriceofRAMgoinglowerandlowereachday,in-memorycomputingismoreaffordableasreal-timesolutions,suchasApacheSpark,whichisapopularopensourceimplementationofin-memorycomputing.SparkcanbeeasilyintegratedwithHadoopandtheresilientdistributeddatasetcanbegeneratedfromdatasourcessuchasHDFSandHBaseforefficientcaching.

Streamprocessingistocontinuouslyprocessandactonthelivestreamdatatogetaresult.Instreamprocessing,therearetwopopularframeworks:Storm(https://storm.apache.org/)fromTwitterandS4(http://incubator.apache.org/s4/)fromYahoo!.BoththeframeworksrunontheJavaVirtualMachine(JVM)andbothprocesskeyedstreams.Intermsoftheprogrammingmodel,S4isaprogramdefinedasagraphofProcessingElements(PE),smallsubprograms,andS4instantiatesaPEperkey.Inshort,Stormgivesyouthebasictoolstobuildaframework,whileS4givesyouawell-definedframework.

www.it-ebooks.info

https://storm.apache.org/

http://incubator.apache.org/s4/


www.it-ebooks.info


OverviewoftheHadoopecosystemHadoopwasfirstreleasedbyApachein2011asversion1.0.0.ItonlycontainedHDFSandMapReduce.Hadoopwasdesignedasbothacomputing(MapReduce)andstorage(HDFS)platformfromtheverybeginning.Withtheincreasingneedforbigdataanalysis,HadoopattractslotsofothersoftwaretoresolvebigdataquestionstogetherandmergestoaHadoop-centricbigdataecosystem.ThefollowingdiagramgivesabriefintroductiontotheHadoopecosystemandthecoresoftwareorcomponentsintheecosystems:

TheHadoopecosystem

InthecurrentHadoopecosystem,HDFSisstillthemajorstorageoption.Ontopofit,snappy,RCFile,Parquet,andORCFilecouldbeusedforstorageoptimization.CoreHadoopMapReducereleasedaversion2.0calledYarnforbetterperformanceandscalability.SparkandTezassolutionsforreal-timeprocessingareabletorunontheYarntoworkwithHadoopclosely.HBaseisaleadingNoSQLdatabase,especiallywhenthereisaNoSQLdatabaserequestonthedeployedHadoopclusters.SqoopisstilloneoftheleadingandmaturedtoolsforexchangingdatabetweenHadoopandrelationaldatabases.Flumeisamatureddistributedandreliablelog-collectingtooltomoveorcollectdatatoHDFS.ImpalaandPrestoquerydirectlyagainstthedataonHDFSforbetterperformance.However,HortonworksfocusesonStringerinitiativestomakeHive100timesfaster.Inaddition,HiveoverSparkandHiveoverTezofferachoiceforuserstorunHiveonothercomputingframeworksratherthanMapReduce.Asaresult,Hiveisplayingmoreimportantrolesintheecosystemthanever.

www.it-ebooks.info


www.it-ebooks.info


HiveoverviewHiveisastandardforSQLqueriesoverpetabytesofdatainHadoop.ItprovidesSQL-likeaccessfordatainHDFSmakingHadooptobeusedlikeawarehousestructure.TheHiveQueryLanguage(HQL)hassimilarsemanticsandfunctionsasstandardSQLintherelationaldatabasesothatexperienceddatabaseanalystscaneasilygettheirhandsonit.Hive’squerylanguagecanrunondifferentcomputingframeworks,suchasMapReduce,Tez,andSparkforbetterperformance.

Hive’sdatamodelprovidesahigh-level,table-likestructureontopofHDFS.Itsupportsthreedatastructures:tables,partitions,andbuckets,wheretablescorrespondtoHDFSdirectoriesandcanbedividedintopartitions,whichinturncanbedividedintobuckets.HivesupportsamajorityofprimitivedataformatssuchasTIMESTAMP,STRING,FLOAT,BOOLEAN,DECIMAL,DOUBLE,INT,SMALLINT,BIGINT,andcomplexdatatypes,suchasUNION,STRUCT,MAP,andARRAY.

ThefollowingdiagramisthearchitectureseeninsidetheviewofHiveintheHadoopecosystem.TheHivemetadatastore(orcalledmetastore)canuseeitherembedded,local,orremotedatabases.HiveserversarebuiltonApacheThriftServertechnology.SinceHivehasreleased0.11,HiveServer2isavailabletohandlemultipleconcurrentclients,whichsupportKerberos,LDAP,andcustompluggableauthentication,providingbetteroptionsforJDBCandODBCclients,especiallyformetadataaccess.

Hivearchitecture

HerearesomehighlightsofHivethatwecankeepinmindmovingforward:

HiveprovidesasimplerquerymodelwithlesscodingthanMapReduceHQLandSQLhavesimilarsyntaxHiveprovideslotsoffunctionsthatleadtoeasieranalyticsusageTheresponsetimeistypicallymuchfasterthanothertypesofqueriesonthesame

www.it-ebooks.info


typeofhugedatasetsHivesupportsrunningondifferentcomputingframeworksHivesupportsadhocqueryingdataonHDFSHivesupportsuser-definedfunctions,scripts,andacustomizedI/OformattoextenditsfunctionalityHiveisscalableandextensibletovarioustypesofdataandbiggerdatasetsMaturedJDBCandODBCdriversallowmanyapplicationstopullHivedataforseamlessreportingHiveallowsuserstoreaddatainarbitraryformats,usingSerDesandInput/OutputformatsHivehasawell-definedarchitectureformetadatamanagement,authentication,andqueryoptimizationsThereisabigcommunityofpractitionersanddevelopersworkingonandusingHive

www.it-ebooks.info


www.it-ebooks.info


SummaryAftergoingthroughthischapter,wearenowabletounderstandwhyandwhentousebigdatainsteadofatraditionalrelationaldatabase.Wealsounderstandthedifferencebetweenbatchprocessing,real-timeprocessing,andstreamprocessing.WegotfamiliarwiththeHadoopecosystem,especiallyHive.Wehavealsogonebackintimeandbrushedthroughthehistoryofdatabaseandwarehousetobigdataalongwithsomebigdataterms,theHadoopecosystem,Hivearchitecture,andtheadvantageofusingHive.Inthenextchapter,wewillpracticesettingupHiveandallthetoolsneededtogetstartedusingHiveinthecommandline.

www.it-ebooks.info


www.it-ebooks.info


Chapter2.SettingUptheHiveEnvironmentThischapterwillintroducehowtoinstallandsetuptheHiveenvironmentintheclusterandcloud.ItalsocoverstheusageofbasicHivecommandsandtheHive-integrateddevelopmentenvironment.


InstallingHivefromApacheInstallingHivefromvendorpackagesStartingHiveinthecloudUsingtheHivecommandlineandBeelineTheHive-integrateddevelopmentenvironment

www.it-ebooks.info


InstallingHivefromApacheTointroducetheHiveinstallation,weuseHiveversion1.0.0asanexample.Thepre-installationrequirementsforthisinstallationareasfollows:

JDK1.7.0_51Hadoop0.20.x,0.23.x.y,1.x.y,or2.x.yUbuntu14.04/CentOS6.2

NoteSincewefocusonHiveinthisbook,theinstallationstepsforJavaandHadooparenotprovidedhere.Forstepsoninstallingthem,pleaserefertohttps://www.java.com/en/download/help/download_options.xmlandhttp://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html.

ThefollowingstepsdescribehowtoinstallHivefromApachethroughtheLinuxcommandline:

1. DownloadHivefromApacheHiveandunpackit:

bash-4.1$wgethttp://apache.mirror.rafal.ca/hive/hive-1.0.0/apache-

hive-1.0.0-bin.tar.gz

bash-4.1$tar-zxvfapache-hive-1.0.0-bin.tar.gz

2. AddHivetothesystempathbyopening/etc/profileor~/.bashrcandaddthefollowingtworows:

exportHIVE_HOME=/home/hivebooks/apache-hive-1.0.0-bin

exportPATH=$PATH:$HIVE_HOME/bin:$HIVE_HOME/conf

3. Enablethesettingsimmediately:

bash-4.1$source/etc/profile

4. Createtheconfigurationfiles:

bash-4.1$cdapache-hive-1.0.0-bin/conf

bash-4.1$cphive-default.xml.templatehive-site.xml

bash-4.1$cphive-env.sh.templatehive-env.sh

bash-4.1$cphive-exec-log4j.properties.templatehive-exec-

log4j.properties

bash-4.1$cphive-log4j.properties.templatehive-log4j.properties

5. Modifytheconfigurationfileat$HIVE_HOME/conf/hive-env.sh:

#SetHADOOP_HOMEtopointtoaspecificHadoopinstalldirectory

exportHADOOP_HOME=/home/hivebooks/hadoop-2.2.0

#HiveConfigurationDirectorycanbeaccessedat:

exportHIVE_CONF_DIR=/home/hivebooks/apache-hive-1.0.0-bin/conf

6. Modifytheconfigurationfileat$HIVE_HOME/conf/hive-site.xml.Therearesomeimportantparametersthatneedspecialattention:

www.it-ebooks.info

https://www.java.com/en/download/help/download_options.xml

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html


hive.metastore.warehourse.dir:ThisisthepathforHivewarehousestorage.Bydefaultitis/user/hive/warehouse.hive.exec.scratchdir:Thisisthetemporarydatafilepath.Bydefaultitis/tmp/hive-${user.name}.

Bydefault,HiveusestheDerby(http://db.apache.org/derby/)databaseasthemetadatastore.Hivecanalsouseotherdatabases,suchasPostgreSQL(http://www.postgresql.org/)orMySQL(http://www.mysql.com/)asthemetadatastore.ToconfigureHivetouseotherdatabases,thefollowingparametersshouldbeconfigured:

javax.jdo.option.ConnectionURL//thedatabaseURL

javax.jdo.option.ConnectionDriverName//theJDBCdrivername

javax.jdo.option.ConnectionUserName//databaseusername

javax.jdo.option.ConnectionPassword//databasepassword

ThefollowingisanexamplesettingusingMySQLasthemetastoredatabase:

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://myhost:3306/hive?createDatabase

IfNotExist=true</value>

<description>JDBCconnectstringforaJDBCmetastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>DriverclassnameforaJDBCmetastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>hive</value>

<description>usernametouseagainstmetastoredatabase</description>

</property>

<property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>hive</value>

<description>passwordtouseagainstmetastoredatabase</description>

</property>

MakesuretheMySQLJDBCdriverisavailableat$HIVE_HOME/lib.

NoteThedifferencesbetweenanembedDerbydatabaseandanexternaldatabaseisthatanexternaldatabaseoffersasharedservicesothatuserscansharethemetadataofHive.However,anembeddatabaseisonlyvisibletothelocalusers.

CreatefoldersandgrantproperwritepermissionstotheusergroupintheHDFSfolder:

bash-4.1$hdfsdfs–mkdir/tmp

bash-4.1$hdfsdfs–mkdir/user/hive/warehouse

bash-4.1$hdfsdfs-chmodg+w/tmp

bash-4.1$hdfsdfs-chmodg+w/user/hive/warehouse

www.it-ebooks.info

http://db.apache.org/derby/

http://www.postgresql.org/

http://www.mysql.com/


That’sallaboutApacheHiveinstallation.InoneoftheHivenodesinstalled,typehivetoentertheHivecommand-lineenvironment(hive>),whichverifiesHiveissuccessfullyinstalled.

www.it-ebooks.info


www.it-ebooks.info


InstallingHivefromvendorpackagesRightnow,manycompanies,suchasCloudera,MapR,IBM,andHortonworks,havepackagedHadoopintomoreeasilymanageabledistributions.Eachcompanytakesaslightlydifferentstrategy,buttheconsensusforallofthesepackagesistomakeHadoopeasiertouseforenterprise.Forexample,wecaneasilyinstallHivefromClouderaDistributedHadoop(CDH),whichcanbedownloadedfromhttp://www.cloudera.com/content/cloudera/en/downloads/cdh.html.

OnceCDHisinstalledtohavetheHadoopenvironmentready,wecanaddHivetotheHadoopclusterbyfollowingafewsteps:

1. LogintotheClouderamanagerandclickonthedropdownbuttonaftertheclusternametochooseAddaService.

Clouderamanagermainpage

2. InthefirstAddServiceWizardpage,chooseHivetoinstall.

www.it-ebooks.info

http://www.cloudera.com/content/cloudera/en/downloads/cdh.html


3. InthesecondAddServiceWizardpage,setthedependenciesfortheservice.SentryistheauthorizationpolicyserviceforHive.

4. InthethirdAddServiceWizardpage,choosetheproperhostsforHiveServer2,HiveMetastoreServer,WebHCatServer,andGateway.

5. InthefourthAddServiceWizardpage,configureHiveMetastoreServerdatabaseconnections.

www.it-ebooks.info


6. InthelastpageofAddServiceWizard,reviewthechangesontheHivewarehousedirectoryandmetastoreserverportnumber.KeepthedefaultvaluesandclickontheContinuebuttontostartinstallingtheHiveservice.Onceitiscomplete,closethewizardtofinishtheHiveinstallation.

NoteHivecanalsobeinstalledalongwithotherserviceswhenwefirstinstallCDHintheclusterorwecandirectlyimportthevendors’quick-startHadoopvirtualmachineimage.

www.it-ebooks.info


www.it-ebooks.info


StartingHiveinthecloudRightnow,AmazonEMR,ClouderaDirector,andMicrosoftAzureHDInsightServicearesomeofthemajorvendorsofferingmaturedHadoopandHiveservicesinthecloud.UsingthecloudversionofHiveisveryconvenient.Italmostrequestsnoinstallationandsetup.

AmazonEMR(http://aws.amazon.com/elasticmapreduce/)istheearliestHadoopserviceinthecloud.However,itisnotapureopensourcedversionofHadoop,butiscustomizedtorunonlyonAWScloud.ClouderaisoneofthefirstfewplayersthatofferedopensourceHadoopsolutionstotheenterprise.SincethemiddleofOctober2014,ClouderahasdeliveredClouderaDirector(http://www.cloudera.com/content/cloudera/en/products-and-services/director.html),whichopensupHadoopdeploymentsinthecloudthroughasimple,self-serviceinterface,andisfullysupportedonAmazonWebServices.WindowsAzureHDInsightService(http://azure.microsoft.com/en-us/documentation/services/hdinsight/)isaservicethatdeploysandprovisionsApacheHadoopclustersintheAzurecloud.AlthoughHadoopwasfirstbuiltonLinux,HortonworksandMicrosofthavepartneredtobringthebenefitsofApacheHadooptotheWindowsAzurecloud.

TheconsensusamongallthevendorshereistoallowtheenterprisetoprovisionhighlyavailableHadoopclusterspoweredwithflexibility,security,management,andgovernancefunctionalitieswithaverysimpleuserinterface.

www.it-ebooks.info

http://aws.amazon.com/elasticmapreduce/

http://www.cloudera.com/content/cloudera/en/products-and-services/director.html

http://azure.microsoft.com/en-us/documentation/services/hdinsight/


www.it-ebooks.info


UsingtheHivecommandlineandBeelineHivefirststartedwithHiveServer1.However,thisversionoftheHiveserverwasnotverystable.Itsometimessuspendedorblockedclients’connectionquietly.Sinceversion11,HiveincludesanewHiveservercalledHiveSever2asanadditiontoHiveServer1.HiveServer2isanenhancedHiveserverdesignedformulticlientconcurrencyandimprovedauthentication.HiveServer2alsosupportsBeelineasthealternativecommand-lineinterface.HiveServer1isdeprecatedandremovedfromHivesinceversion1.0.0.

TheprimarydifferencebetweenthetwoHiveserversishowtheclientsconnecttoHive.HiveCLIisanApacheThrift-basedclient,andBeelineisaJDBCclientbasedonSQLLine(http://sqlline.sourceforge.net/)CLI.TheHiveCLIdirectlyconnectstotheHivedriversandrequiresinstallingHiveonthesamemachineastheclient.However,BeelineconnectstoHiveServer2throughJDBCconnectionsanddoesnotrequiretheinstallationofHivelibrariesonthesamemachineastheclient.ThatmeanswecanrunBeelineremotelyfromoutsideoftheHadoopcluster.

ThefollowingtableisthecommonlyusedcommandsforbothBeelineandHiveCLI.FormoreusageofHiveServer2andBeeline,refertohttps://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients.

Purpose HiveServer2Beeline HiveServer1CLI

Serverconnection beeline–u<jdbcurl>-n<username>-p<password> hive-h<hostname>-p<port>

Help beeline-horbeeline--help hive-H

Runquery beeline-e<queryinquotes>

beeline-f<queryfilename>

hive-e<queryinquotes>

hive-f<queryfilename>

Definevariablebeeline--hivevarkey=value.

ThisisavailableafterHive0.13.0.hive--hivevarkey=value

Thefollowingisthecommand-linesyntaxinBeelineorHiveCLI:

Purpose HiveServer2Beeline HiveServer1CLI

Entermode beeline hive

Connect !connect<jdbcurl> n/a

Listtables !table showtables;

Listcolumns !column<table_name> desc<table_name>;

Runquery <HQLquery>; <HQLquery>;

Saveresultset !record<file_name>

!recordN/A

www.it-ebooks.info

http://sqlline.sourceforge.net/

https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients


RunshellCMD!shls

ThisisavailablesinceHive0.14.0.!ls;

RundfsCMD dfs-ls dfs-ls;

RunfileofSQL !run<file_name> source<file_name>;

CheckHiveversion !dbinfo !hive--version;

Quitmode !quit quit;

NoteForBeeline,;isnotneededafterthecommandthatstartswith!.

WhenrunningaqueryinHiveCLI,theMapReducestatisticsinformationisshownintheconsolescreenwhileprocessing,whereasBeelinedoesnot.

BothBeelineandHiveCLIdonotsupportrunningapastedquerywith<tab>inside,because<tab>isusedforautocompletebydefaultintheenvironment.Alternatively,runningthequeryfromfileshasnosuchissues.

HiveCLIshowstheexactlineandpositionoftheHivequeryorsyntaxerrorswhenthequeryhasmultiplelines.However,Beelineprocessesthemultiple-linequeryasasingleline,soonlythepositionisshownforqueryorsyntaxerrorswiththelinenumberas1forallinstances.Forthisaspect,HiveCLIismoreconvenientthanBeelinefordebuggingtheHivequery.

InbothHiveCLIandBeeline,usingtheupanddownarrowkeyscanretrieveupto10,000previouscommands.The!historycommandcanbeusedinBeelinetoshowallhistory.

BothHiveCLIandBeelinesupportsvariablesubstitution;refertohttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution.

AlistofHiveconfigurationsettingsandpropertiescanbeaccessedandoverwrittenbythesetkeywordfromthecommand-lineenvironment.Formoredetails,refertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties.

www.it-ebooks.info

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties


www.it-ebooks.info


TheHive-integrateddevelopmentenvironmentBesidesthecommand-lineinterface,thereareafewintegrateddevelopmentenvironment(IDE)toolsavailableforHivedevelopment.OneofthebestisOracleSQLDeveloper,whichleveragesthepowerfulfunctionalitiesofOracleIDEandistotallyfreetouse.IfwehavetouseOraclealongwithHiveinaproject,itisquiteconvenienttoswitchbetweenthemonlyfromthesameIDE.

OracleSQLdeveloperhassupportedHivesinceversion4.0.3.ConfiguringittoworkwithHiveisquitestraightforward.ThefollowingareafewstepstoconfiguretheIDEtoconnecttoHive:

1. DownloadHiveJDBCdriversfromthevendorwebsite,suchasCloudera.2. UnziptheJDBCversion4drivertoalocaldirectory.3. StartOracleSQLDeveloperandnavigatetoPreferences|Database|ThirdParty

JDBCDrivers.4. AddalloftheJARfilescontainedintheunzippeddirectorytotheThird-party

JDBCDriverPathsettingasfollows:

SQLdeveloperconfiguration

5. ClickontheOKbuttonandrestartOracleSQLDeveloper.6. CreatenewconnectionsintheHivetabgivingaproperConnectionName,

www.it-ebooks.info


Username,Password,Hostname(Hiveserverhostname),Port,andDatabase.Then,clickontheAddandConnectbuttonstoconnecttoHive.

SQLdeveloperconnections

InOracleSQLDeveloper,wecanrunallHiveinteractivecommandsaswellasHivequeries.WecanalsoleveragethepowerofOracleSQLDevelopertobrowseandexportdataintoaHivetablefromthegraphicuserinterfaceandwizard.

BesidesHiveIDE,Hivealsohasitsownbuilt-inwebinterface,HiveWebInterface.However,itisnotpowerfulandisnotbeingusedveryoften.Hue(http://gethue.com/)isanotherwebinterfacefortheHadoopecosystem,includingHive.Itisaverypowerfulanduser-friendlywebuserinterface.MoredetailsaboutusingHuewithHiveareintroducedinChapter10,WorkingwithOtherTools.

www.it-ebooks.info

http://gethue.com/


www.it-ebooks.info


SummaryInthischapter,weintroducedthesetupofHiveindifferentenvironmentswithpropersettings.WealsolookedintoafewoftheHiveinteractivecommandsandqueriesinHiveCLI,Beeline,andIDEs.Aftergoingthroughthischapter,weshouldbeabletosetupourownHiveenvironmentlocallyanduseHivefromCLIorIDEtools.

Inthenextchapter,wewilldiveintothedetailsofHivedatadefinitionlanguages.

www.it-ebooks.info


www.it-ebooks.info


Chapter3.DataDefinitionandDescriptionThischapterintroducesthebasicdatatypes,datadefinitionlanguage,andschemainHivetodescribedata.Italsocoversthebestpracticestodescribedatacorrectlyandeffectivelybyusinginternalorexternaltables,partitions,buckets,andviews.


HiveprimitiveandcomplexdatatypesDatatypeconversionsHivetablesHivepartitionsHivebucketsHiveviews

www.it-ebooks.info


UnderstandingHivedatatypesHivedatatypesarecategorizedintotwotypes:primitiveandcomplexdatatypes.Stringandintegerarethemostusefulprimitivetypes,whicharesupportedbymostHivefunctions.

TipDownloadingtheexamplecode

Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou

Thedetailsofprimitivetypesareasfollows:

Primitivedatatype Description Example

TINYINTIthas1bytefrom-128to127.ThepostfixisY.Itisusedasasmallrangeofnumbers. 10Y

SMALLINTIthas2bytesfrom-32,768to32,767.ThepostfixisS.Itisusedasaregulardescriptivenumber. 10S

INT Ithas4bytesfrom-2,147,483,648to2,147,483,647. 10

BIGINTIthas8bytesfrom-9,223,372,036,854,775,808to9,223,372,036,854,775,807.ThepostfixisL. 100L

FLOAT

Thisisa4-bytesingleprecisionfloatingpointnumberfrom1.40129846432481707e-45to3.40282346638528860e+38(positiveornegative).Scientificnotationisnotyetsupported.Itstoresverycloseapproximationsofnumericvalues.

1.2345679

DOUBLE

Thisisan8-bytedoubleprecisionfloatingpointnumberfrom4.94065645841246544e-324dto1.79769313486231570e+308d(positiveornegative).Scientificnotationisnotyetsupported.Itstoresverycloseapproximationsofnumericvalues.

1.2345678901234567

DECIMAL

ThiswasintroducedinHive0.11.0withahardcodeprecisionof38digits.Hive0.13.0introduceduserdefinableprecisionandscale.Itisaround1039-1to1-1038.Decimaldatatypesstoreexactrepresentationsofnumericvalues.Thedefaultdefinitionofthistypeisdecimal(10,0).

DECIMAL(3,2)for3.14

BINARY ThiswasintroducedinHive0.8.0andonlysupportsCASTtoSTRINGandviceversa. 1011

BOOLEAN ThisisaTRUEorFALSEvalue. TRUE

STRINGThisincludescharactersexpressedwitheithersinglequotes(‘)ordoublequotes(“).HiveusesC-styleescapingwithinthestrings.Themaxsizeisaround2G. ‘Books’or“Books”

www.it-ebooks.info


http://www.packtpub.com/support


CHAR ThisisavailablestartingwithHive0.13.0.MostUDFwillworkforthistypeafterHive0.14.0.Themaximumlengthisfixedat255.

‘US’or“US”

VARCHAR

ThisisavailablestartingwithHive0.12.0.MostUDFwillworkforthistypeafterHive0.14.0.Themaximumlengthisfixedat65355.Ifastringvaluebeingconverted/assignedtoavarcharvalueexceedsthelengthspecified,thestringissilentlytruncated.

‘Books’or“Books”

DATEThisdescribesaspecificyear,month,anddayintheformatofYYYY-MM-DD.ItisavailablesinceHive0.12.0.Therangeofdateisfrom0000-01-01to9999-12-31. ‘2013-01-01’

TIMESTAMP

Thisdescribesaspecificyear,month,day,hours,minutes,seconds,andmillisecondsintheformatofYYYY-MM-DDHH:MM:SS[.fff…].ItisavailablesinceHive0.8.0.

‘2013-01-0112:00:01.345’

Hivehasthreemaincomplextypes:ARRAY,MAP,andSTRUCT.Thesedatatypesarebuiltontopoftheprimitivedatatypes.ARRAYandMAParesimilartothatinJava.STRUCTisarecordtype,whichmaycontainasetofanytypeoffields.Complextypesallowthenestingoftypes.Thedetailsofcomplextypesareasfollows:

Complexdatatype

Description Example

ARRAYThisisalistofitemsofthesametype,suchas(val1,val2,andsoon).Youcanaccessthevalueusingarray_name[index],forexample,fruit[0]='apple'. [‘apple’,‘orange’,‘mango’]

MAPThisisasetofkey-valuepairs,suchas(key1,val1,key2,val2,andsoon).Youcanaccessthevalueusingmap_name[key],forexample,fruit[1]="apple". {1:“apple”,2:“orange”}

STRUCT

Thisisauser-definedstructureofanytypeoffields,suchas{val1,val2,val3,andsoon}.Bydefault,STRUCTfieldnameswillbecol1,col2,andsoon.Youcanaccessthevalueusingstructs_name.column_name,forexample,fruit.col1=1.

{1,“apple”}

NAMED

STRUCT

Thisisauser-definedstructureofanynumberoftypedfields,suchas(name1,val1,name2,val2,andsoon).Youcanaccessthevalueusingstructs_name.column_name,forexample,fruit.apple="gala".

{“apple”:“gala”,“weightkg”:1}

UNIONThisisastructurethathasexactlyanyoneofthespecifieddatatypes.ItisavailablesinceHive0.7.0.Itisnotcommonlyused. {2:[“apple”,“orange”]}

NoteForMAP,thetypeofkeysandvaluesareunified.However,STRUCTismoreflexible.STRUCTismorelikeatablewhereasMAPismorelikeanARRAYwithacustomizedindex.

ThefollowingisashortpracticeforallthecommonlyusedHivetypes.ThedetailsoftheCREATE,LOAD,andSELECTstatementswillbedescribedlater.Let’stakealookattheprocess:

1. Preparethedataasfollows:

-bash-4.1$viemployee.txt

www.it-ebooks.info


Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead

Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead

Shelley|NewYork|Female,27|Python:80|Test:Lead,COE:Architect

Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead

2. LogintoBeelinewiththeproperHiveServer2hostname,portnumber,databasename,username,andpassword:

-bash-4.1$beeline

beeline>!connectjdbc:hive2://localhost:10000/default

scancompletein20msConnectingto

jdbc:hive2://localhost:10000/default

Enterusernameforjdbc:hive2://localhost:10000/default:dayongdEnter

passwordforjdbc:hive2://localhost:10000/default:

3. CreateatableusingARRAY,MAP,andSTRUCTcompositedatatypes:

jdbc:hive2://>CREATETABLEemployee

.......>(

.......>namestring,

.......>work_placeARRAY<string>,

.......>sex_ageSTRUCT<sex:string,age:int>,

.......>skills_scoreMAP<string,int>,

.......>depart_titleMAP<string,ARRAY<string>>

.......>)

.......>ROWFORMATDELIMITED

.......>FIELDSTERMINATEDBY'|'

.......>COLLECTIONITEMSTERMINATEDBY','

.......>MAPKEYSTERMINATEDBY':';

Norowsaffected(0.149seconds)

4. Verifythetable’screation:

jdbc:hive2://>!tableemployee

+---------+------------+------------+--------------+---------+

|TABLE_CAT|TABLE_SCHEMA|TABLE_NAME|TABLE_TYPE|REMARKS|

+---------+------------+------------+--------------+---------+

||default|employee|MANAGED_TABLE||

+---------+------------+------------+--------------+---------+

jdbc:hive2://>!columnemployee

+--------------+-------------+---------------+---------------+

|TABLE_SCHEM|TABLE_NAME|COLUMN_NAME|TYPE_NAME|

+--------------+-------------+---------------+---------------+

|default|employee|name|STRING|

|default|employee|work_place|array<string>|

|default|employee|sex_age|

struct<sex:string,age:int>|

|default|employee|skills_score|map<string,int>|

|default|employee|depart_title|map<string,array<string>>

|

+--------------+-------------+---------------+---------------+

5. Loaddataintothetable:

jdbc:hive2://>LOADDATALOCALINPATH'/home/hadoop/employee.txt'

.......>OVERWRITEINTOTABLEemployee;

www.it-ebooks.info



6. Queryalltherowsinthetable:

jdbc:hive2://>SELECT*FROMemployee;

+-------+-------------------+------------+-----------------+-----------

-------------------+

|name|work_place|sex_age|skills_score|

depart_title|

+-------+-------------------+------------+-----------------+-----------

-------------------+

|Michael|[Montreal,Toronto]|[Male,30]|{DB=80}|{Product=

[Developer,Lead]}|

|Will|[Montreal]|[Male,35]|{Perl=85}|{Test=

[Lead],Product=[Lead]}|

|Shelley|[NewYork]|[Female,27]|{Python=80}|{Test=

[Lead],COE=[Architect]}|

|Lucy|[Vancouver]|[Female,57]|{Sales=89,HR=94}|{Sales=

[Lead]}|

+-------+-------------------+------------+-----------------+-----------

-------------------+

4rowsselected(0.677seconds)

7. Querythewholearrayandeacharraycolumninthetable:

jdbc:hive2://>SELECTwork_placeFROMemployee;

+----------------------+

|work_place|

+----------------------+

|[Montreal,Toronto]|

|[Montreal]|

|[NewYork]|

|[Vancouver]|

+----------------------+


jdbc:hive2://>SELECTwork_place[0]AScol_1,

.......>work_place[1]AScol_2,work_place[2]AScol_3

.......>FROMemployee;

+------------+----------+--------+

|col_1|col_2|col_3|

+------------+----------+--------+

|Montreal|Toronto||

|Montreal|||

|NewYork|||

|Vancouver|||

+------------+----------+--------+


8. Querythewholestructandeachstructcolumninthetable:

jdbc:hive2://>SELECTsex_ageFROMemployee;

+---------------+

|sex_age|

+---------------+

|[Male,30]|

|[Male,35]|

www.it-ebooks.info


|[Female,27]|

|[Female,57]|

+---------------+


jdbc:hive2://>SELECTsex_age.sex,sex_age.ageFROMemployee;

+---------+------+

|sex|age|

+---------+------+

|Male|30|

|Male|35|

|Female|27|

|Female|57|

+---------+------+


9. Querythewholemapandeachmapcolumninthetable:

jdbc:hive2://>SELECTskills_scoreFROMemployee;

+--------------------+

|skills_score|

+--------------------+

|{DB=80}|

|{Perl=85}|

|{Python=80}|

|{Sales=89,HR=94}|

+--------------------+


jdbc:hive2://>SELECTname,skills_score['DB']ASDB,

.......>skills_score['Perl']ASPerl,

.......>skills_score['Python']ASPython,

.......>skills_score['Sales']asSales,

.......>skills_score['HR']asHR


+----------+-----+-------+---------+--------+-----+

|name|db|perl|python|sales|hr|

+----------+-----+-------+---------+--------+-----+

|Michael|80|||||

|Will||85||||

|Shelley|||80|||

|Lucy||||89|94|

+----------+-----+-------+---------+--------+-----+


NoteNotethatthecolumnnameshownintheresultsetforHiveisalwaysinlowercaseletters.

10. Querythecompositetypeinthetable:

jdbc:hive2://>SELECTdepart_titleFROMemployee;

+---------------------------------+

|depart_title|

+---------------------------------+

|{Product=[Developer,Lead]}|

www.it-ebooks.info


|{Test=[Lead],Product=[Lead]}|

|{Test=[Lead],COE=[Architect]}|

|{Sales=[Lead]}|

+---------------------------------+


jdbc:hive2://>SELECTname,

.......>depart_title['Product']ASProduct,

.......>depart_title['Test']ASTest,

.......>depart_title['COE']ASCOE,

.......>depart_title['Sales']ASSales


+--------+--------------------+---------+-------------+------+

|name|product|test|coe|sales|

+--------+--------------------+---------+-------------+------+

|Michael|[Developer,Lead]||||

|Will|[Lead]|[Lead]|||

|Shelley||[Lead]|[Architect]||

|Lucy||||[Lead]|

+--------+--------------------+---------+-------------+------+


jdbc:hive2://>SELECTname,

.......>depart_title['Product'][0]ASproduct_col0,

.......>depart_title['Test'][0]AStest_col0


+----------+---------------+------------+

|name|product_col0|test_col0|

+----------+---------------+------------+

|Michael|Developer||

|Will|Lead|Lead|

|Shelley||Lead|

|Lucy|||

+----------+---------------+------------+


NoteThedefaultdelimitersinHiveareasfollows:

Rowdelimiter:ThiscanbeusedwithCtrl+Aor^A(Use\001whencreatingthetable)Collectionitemdelimiter:ThiscanbeusedwithCtrl+Bor^B(\002)Mapkeydelimiter:ThiscanbeusedwithCtrl+Cor^C(\003)

Ifthedelimiterisoveriddenduringthetablecreation,itonlyworkswhenusedintheflatstructure.ThisisstillalimitationinHivedescribedinApacheJiraHive-365(https://issues.apache.org/jira/browse/HIVE-365).

Fornestedtypes,forexample,thedepart_titlecolumnintheprecedingtables,thelevelofnestingdeterminesthedelimiter.UsingARRAYofARRAYasanexample,thedelimitersfortheouterARRAYareCtrl+B(\002)characters,asexpected,butfortheinnerARRAYtheyareCtrl+C(\003)characters,thenextdelimiterinthelist.ForourexampleofusingMAP

www.it-ebooks.info

https://issues.apache.org/jira/browse/HIVE-365


ofARRAY,theMAPkeydelimiteris\003,andtheARRAYdelimiterisCtrl+Dor^D(\004).

www.it-ebooks.info


www.it-ebooks.info


DatatypeconversionsSimilartoJava,Hivesupportsbothimplicittypeconversionandexplicittypeconversion.

Primitivetypeconversionfromanarrowtoawidertypeisknownasimplicitconversion.However,thereverseconversionisnotallowed.Alltheintegralnumerictypes,FLOAT,andSTRINGcanbeimplicitlyconvertedtoDOUBLE,andTINYINT,SMALLINT,andINTcanallbeconvertedtoFLOAT.BOOLEANtypescannotbeconvertedtoanyothertype.IntheApacheHivewiki,thereisadatatypecrosstabledescribingtheallowedimplicitconversionbetweeneverytwotypesinHiveandthiscanbefoundathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types.

ExplicittypeconversionisusingtheCASTfunctionwiththeCAST(valueASTYPE)syntax.Forexample,CAST('100'ASINT)willconvertthestring100totheintegervalue100.Ifthecastfails,suchasCAST('INT'ASINT),thefunctionreturnsNULL.Inaddition,theBINARYtypecanonlycasttoSTRING,thencastfromSTRINGtoothertypes,ifneeded.

www.it-ebooks.info

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types


www.it-ebooks.info


HiveDataDefinitionLanguageHiveDataDefinitionLanguage(DDL)isasubsetofHiveSQLstatementsthatdescribethedatastructureinHivebycreating,deleting,oralteringschemaobjectssuchasdatabases,tables,views,partitions,andbuckets.MostHiveDDLstatementsstartwiththekeywordsCREATE,DROP,orALTER.ThesyntaxofHiveDDLisverysimilartotheDDLinSQL.ThecommentsinHivestartfrom--.

www.it-ebooks.info


www.it-ebooks.info


HivedatabaseThedatabaseinHivedescribesacollectionoftablesthatareusedforasimilarpurposeorbelongtothesamegroups.Ifthedatabaseisnotspecified,thedefaultdatabaseisused.Wheneveranewdatabaseiscreated,Hivecreatesadirectoryforeachdatabaseat/user/hive/warehouse,definedinhive.metastore.warehouse.dir.Forexample,themyhivebookdatabaseislocatedat/user/hive/datawarehouse/myhivebook.db.However,thedefaultdatabasedoesn’thaveitsowndirectory.ThefollowingisthecoreDDLforHivedatabases:

Createthedatabasewithoutcheckingwhetherthedatabasealreadyexists:

jdbc:hive2://>CREATEDATABASEmyhivebook;

Createthedatabaseandcheckwhetherthedatabasealreadyexists:

jdbc:hive2://>CREATEDATABASEIFNOTEXISTSmyhivebook;

Createthedatabasewithlocation,comments,andmetadatainformation:

jdbc:hive2://>CREATEDATABASEIFNOTEXISTSmyhivebook

.......>COMMENT'hivedatabasedemo'

.......>LOCATION'/hdfs/directory'

.......>WITHDBPROPERTIES('creator'='dayongd','date'='2015-01-

01');

Showanddescribethedatabasewithwildcards:

jdbc:hive2://>SHOWDATABASES;

+----------------+

|database_name|

+----------------+

|default|

+----------------+

1rowselected(1.7seconds)

jdbc:hive2://>SHOWDATABASESLIKE'my.*';

jdbc:hive2://>DESCRIBEDATABASEdefault;

+-------+----------------------+-----------------------------+

|db_name|comment|location|

+-------+----------------------+-----------------------------+

|default|DefaultHivedatabase

|hdfs://localhost:8020/user/hive/warehouse|

+-------+----------------------+-----------------------------+


Usethedatabase:

jdbc:hive2://>USEmyhivebook;

Droptheemptydatabase:

jdbc:hive2://>DROPDATABASEIFEXISTSmyhivebook;

Note

www.it-ebooks.info


NotethatHivekeepsthedatabaseandthetableindirectorymode.Inordertoremovetheparentdirectory,weneedtoremovethesubdirectoriesfirst.Bydefault,thedatabasecannotbedroppedifitisnotempty,unlessCASCADEisspecified.CASCADEdropsthetablesinthedatabaseautomaticallybeforedroppingthedatabase.

DropthedatabasewithCASCADE:

jdbc:hive2://>DROPDATABASEIFEXISTSmyhivebookCASCADE;

Alterthedatabaseproperties.TheALTERDATABASEstatementcanonlyapplytothetablepropertiesandroles(Hive0.13.0andlater)onthetable.Theothermetadataaboutthedatabasecannotbechanged:

jdbc:hive2://>ALTERDATABASEmyhivebook

.......>SETDBPROPERTIES('edited-by'='Dayong');

jdbc:hive2://>ALTERDATABASEmyhivebook

SETOWNERuserdayongd;

NoteSHOWandDESCRIBE

TheSHOWandDESCRIBEkeywordsinHiveareusedtoshowthedefinitioninformationformostoftheHiveobjects,suchastables,partitions,andsoon.

TheSHOWstatementsupportsawiderangeofHiveobjects,suchastables,tables’properties,tableDDL,index,partitions,columns,functions,locks,roles,configurations,transactions,andcompactions.

TheDESCRIBEstatementsupportsasmallrangeofHiveobjects,suchasdatabases,tables,views,columns,andpartitions.However,theDESCRIBEstatementisabletoprovidemoredetailedinformationcombinedwiththeEXTENDEDorFORMATTEDkeywords.

Inthisbook,thereisnosinglesectiontointroduceSHOWandDESCRIBE,butweintroducetheirusageinlinewithotherHQLthroughtheremainingchapters.

www.it-ebooks.info


www.it-ebooks.info


HiveinternalandexternaltablesTheconceptofatableinHiveisverysimilartothetableintherelationaldatabase.Eachtableassociateswithadirectoryconfiguredin${HIVE_HOME}/conf/hive-site.xmlinHDFS.Bydefault,itis/user/hive/warehouseinHDFS.Forexample,/user/hive/warehouse/employeeiscreatedbyHiveinHDFSfortheemployeetable.Allthedatainthetablewillbekeptinthedirectory.TheHivetableisalsoreferredtoasinternalormanagedtables.

WhenthereisdataalreadyinHDFS,anexternalHivetablecanbecreatedtodescribethedata.ItiscalledEXTERNALbecausethedataintheexternaltableisspecifiedintheLOCATIONpropertiesinsteadofthedefaultwarehousedirectory.Whenkeepingdataintheinternaltables,Hivefullymanagesthelifecycleofthetableanddata.Thismeansthedataisremovedoncetheinternaltableisdropped.Iftheexternaltableisdropped,thetablemetadataisdeletedbutthedataiskept.Mostofthetime,anexternaltableispreferredtoavoiddeletingdataalongwithtablesbymistake.ThefollowingareDDLsforHiveinternalandexternaltableexamples:

Showthedatabasefile’slocationandcontentoftheemployeeinternaltable:

bash-4.1$vi/home/hadoop/employee.txt

Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead

Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead

Shelley|NewYork|Female,27|Python:80|Test:Lead,COE:Architect

Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead

Createtheinternaltableandloadthedata:

jdbc:hive2://>CREATETABLEIFNOTEXISTSemployee_internal

.......>(

.......>namestring,




.......>depart_titleMAP<STRING,ARRAY<STRING>>

.......>)

.......>COMMENT'Thisisaninternaltable'




.......>MAPKEYSTERMINATEDBY':'

.......>STOREDASTEXTFILE;


jdbc:hive2://>LOADDATALOCALINPATH'/home/hadoop/employee.txt'

.......>OVERWRITEINTOTABLEemployee_internal;

Createtheexternaltableandloadthedata:

jdbc:hive2://>CREATEEXTERNALTABLEemployee_external

.......>(

.......>namestring,


www.it-ebooks.info





.......>)

.......>COMMENT'Thisisanexternaltable'




.......>MAPKEYSTERMINATEDBY':'

.......>STOREDASTEXTFILE

.......>LOCATION'/user/dayongd/employee';


jdbc:hive2://>LOADDATALOCALINPATH'/home/hadoop/employee.txt'...

....>OVERWRITE

INTOTABLEemployee_external;

NoteCREATETABLE

TheHivetabledoesnothaveconstraintssuchasadatabaseyet.

IfthefolderinthepathdoesnotexistintheLOCATIONproperty,Hivewillcreatethatfolder.IfthereisanotherfolderinsidethefolderspecifiedintheLOCATIONproperty,HivewillNOTreporterrorswhencreatingthetable,butwillreportanerrorwhenqueryingthetable.

Atemporarytable,whichisautomaticallydeletedattheendoftheHivesession,issupportedinHive0.14.0byHIVE-7090(https://issues.apache.org/jira/browse/HIVE-7090)throughtheCREATETEMPORARYTABLEstatement.

FortheSTOREASproperty,itissettoASTEXTFILEbydefault.Otherfileformatvalues,suchasSEQUENCEFILE,RCFILE,ORC,AVRO(sinceHive0.14.0),andPARQUET(sinceHive0.13.0)canalsobespecified.

Createthetableasselect(CTAS):

jdbc:hive2://>CREATETABLEctas_employee

.......>ASSELECT*FROMemployee_external;


NoteCTAS

CTAScopiesthedataaswellastabledefinitions.ThetablecreatedbyCTASisatomic;thismeansthatotherusersdonotseethetableuntilallthequeryresultsarepopulated.CTAShasthefollowingrestrictions:

ThetablecreatedcannotbeapartitionedtableThetablecreatedcannotbeanexternaltableThetablecreatedcannotbealistbucketingtable

ACTASstatementwilltriggeramapjobforpopulatingthedata;evenSELECT*itself

www.it-ebooks.info



doesnottriggeranyMapReducejob.

CTASwithCommonTableExpression(CTE)canbecreatedasfollows:

jdbc:hive2://>CREATETABLEcte_employeeAS

.......>WITHr1AS

.......>(SELECTnameFROMr2

.......>WHEREname='Michael'),

.......>r2AS

.......>(SELECTnameFROMemployee

.......>WHEREsex_age.sex='Male'),

.......>r3AS


.......>WHEREsex_age.sex='Female')

.......>SELECT*FROMr1UNIONALLselect*FROMr3;


jdbc:hive2://>SELECT*FROMcte_employee;

+----------------------------+

|cte_employee.name|

+----------------------------+

|Michael|

|Shelley|

|Lucy|

+----------------------------+


NoteCTE

CTEisavailablesinceHive0.13.0.ItisatemporaryresultsetderivedfromasimpleSELECTqueryspecifiedinaWITHclause,followedbySELECTorINSERTkeywordtooperatethisresultset.TheCTEisdefinedonlywithintheexecutionscopeofasinglestatement.OneormoreCTEscanbeusedinanestedorchainedwaywithHivekeywords,suchastheSELECT,INSERT,CREATETABLEASSELECT,orCREATEVIEWASSELECTstatements.

Emptytablescanbecreatedintwowaysasfollows:

1. UseCTASasshownhere:

jdbc:hive2://>CREATETABLEempty_ctas_employeeAS

.......>SELECT*FROMemployee_internalWHERE1=2;


2. UseLIKEasshownhere:

jdbc:hive2://>CREATETABLEempty_like_employee

.......>LIKEemployee_internal;


Checktherowcountsforbothtables:

jdbc:hive2://>SELECTCOUNT(*)ASrow_cnt

www.it-ebooks.info


.......>FROMempty_ctas_employee;

+----------+

|row_cnt|

+----------+

|0|

+----------+


jdbc:hive2://>SELECTCOUNT(*)ASrow_cnt

.......>FROMempty_like_employee;

+----------+

|row_cnt|

+----------+

|0|

+----------+


NoteTheLIKEway,whichisfaster,doesnottriggeraMapReducejobsinceitismetadataduplicationonly.

Thedroptable’scommandremovesthemetadatacompletelyandmovesdatatoTrashortothecurrentdirectoryifTrashisconfigured:

jdbc:hive2://>DROPTABLEIFEXISTSempty_ctas_employee;


jdbc:hive2://>DROPTABLEIFEXISTSempty_like_employee;


Thetruncatetable’scommandremovesalltherowsfromatablethatshouldbeaninternaltable:


+--------------------+

|cte_employee.name|

+--------------------+

|Michael|

|Shelley|

|Lucy|

+--------------------+


jdbc:hive2://>TRUNCATETABLEcte_employee;


--Tableisemptyaftertruncate


+--------------------+

|cte_employee.name|

+--------------------+

+--------------------+

Norowsselected(0.059seconds)

Alterthetable’sstatementstorenamethetable:

www.it-ebooks.info


jdbc:hive2://>ALTERTABLEc_employeeSETFILEFORMATRCFILE;


Alterthetable’slocation,whichmustbeafullURIofHDFS:


.......>SETLOCATION

.......>'hdfs://localhost:8020/user/dayongd/employee';


Alterthetable’senable/disableprotectiontoNO_DROP,whichpreventsatablefrombeingdropped,orOFFLINE,whichpreventsdata(notmetadata)inatablefrombeingqueried:

jdbc:hive2://>ALTERTABLEc_employeeENABLENO_DROP;

jdbc:hive2://>ALTERTABLEc_employeeDISABLENO_DROP;

jdbc:hive2://>ALTERTABLEc_employeeENABLEOFFLINE;

jdbc:hive2://>ALTERTABLEc_employeeDISABLEOFFLINE;

Alterthetable’sconcatenationtomergesmallfilesintolargerfiles:

--Converttothefileformatsupported

jdbc:hive2://>ALTERTABLEc_employeeSETFILEFORMATORC;


--Concatenatefiles

jdbc:hive2://>ALTERTABLEc_employeeCONCATENATE;


--Converttotheregularfileformat

jdbc:hive2://>ALTERTABLEc_employeeSETFILEFORMATTEXTFILE;


NoteCONCATENATE

InHiverelease0.8.0,RCFileisaddedtosupportfastblock-levelmergingofsmallRCFilesusingtheCONCATENATEcommand.InHiverelease0.14.0ORC,thefilesthatareaddedsupportfaststripe-levelmergingofsmallORCfilesusingtheCONCATENATEcommand.Otherfileformatsarenotsupportedyet.IncaseofRCFiles,themergehappensatblocklevelandORCfilesmergeatstripeleveltherebyavoidingtheoverheadofdecompressinganddecodingthedata.MapReduceistriggeredwhenperformingconcatenation.

Alterthecolumn’sdatatype:

--Checkcolumntypebeforechanges

jdbc:hive2://>DESCemployee_internal;

+----------------+-----------------------------+----------+

|col_name|data_type|comment|

+----------------+-----------------------------+----------+

|employee_name|string||

|work_place|array<string>||

|sex_age|struct<sex:string,age:int>||

|skills_score|map<string,int>||

www.it-ebooks.info


|depart_title|map<string,array<string>>||

+----------------+-----------------------------+----------+


--Changecolumntypeandorder

jdbc:hive2://>ALTERTABLEemployee_internal

.......>CHANGEnameemployee_namestringAFTERsex_age;


--Verifythechanges


+----------------+-----------------------------+----------+


+----------------+-----------------------------+----------+



|employee_name|string||



+----------------+-----------------------------+----------+


Alterthecolumn’stypeandorder:

jdbc:hive2://>ALTERTABLEemployee_internal

.......>CHANGEemployee_namenamestringFIRST;


--Verifythechanges


+---------------+-----------------------------+----------+


+---------------+-----------------------------+----------+

|name|string||





+---------------+-----------------------------+----------+


Add/replacecolumns:

--Addcolumnstothetable

jdbc:hive2://>ALTERTABLEc_employeeADDCOLUMNS(workstring);


--Verifytheaddedcolumns

jdbc:hive2://>DESCc_employee;

+-----------+------------+----------+


+-----------+------------+----------+

|name|string||

|work|string||

+-----------+------------+----------+


--Replaceallcolumns

www.it-ebooks.info



.......>REPLACECOLUMNS(namestring);


--Verifythereplacedallcolumns

jdbc:hive2://>DESCc_employee;

+-----------+------------+----------+


+-----------+------------+----------+

|name|string||

+-----------+------------+----------+


NoteTheALTERcommandwillonlymodifyHive’smetadata,NOTthedata.Usersshouldmakesuretheactualdataconformswiththemetadatadefinitionmanually.

www.it-ebooks.info


www.it-ebooks.info


HivepartitionsBydefault,asimplequeryinHivescansthewholeHivetable.Thisslowsdowntheperformancewhenqueryingalarge-sizetable.TheissuecouldberesolvedbycreatingHivepartitions,whichisverysimilartowhat’sintheRDBMS.InHive,eachpartitioncorrespondstoapredefinedpartitioncolumn(s)andstoresitasasubdirectoryinthetable’sdirectoryinHDFS.Whenthetablegetsqueried,onlytherequiredpartitions(directory)ofdatainthetablearequeried,sotheI/Oandtimeofqueryisgreatlyreduced.ItisveryeasytoimplementHivepartitionswhenthetableiscreatedandcheckthepartitionscreated,asfollows:

--

Createpartitionswhencreatingtables

jdbc:hive2://>CREATETABLEemployee_partitioned

.......>(

.......>namestring,





.......>)

.......>PARTITIONEDBY(YearINT,MonthINT)






--Showpartitions

jdbc:hive2://>SHOWPARTITIONSemployee_partitioned;

+------------+

|partition|

+------------+

+------------+


Fromtheprecedingresult,wecanseethatthepartitionisnotenabledautomatically.WehavetouseALTERTABLEADDPARTITIONtoaddpartitionstoatable.TheADDPARTITIONcommandchangesthetable’smetadata,butdoesnotloaddata.Ifthedatadoesnotexistinthepartition’slocation,querieswillnotreturnanyresults.Todropthepartitionincludingbothdataandmetadata,usetheALTERTABLEDROPPARTITIONstatementasfollows:

--Addmultiplepartitions

jdbc:hive2://>ALTERTABLEemployee_partitionedADD

.......>PARTITION(year=2014,month=11)

.......>PARTITION(year=2014,month=12);



+---------------------+

|partition|

+---------------------+

www.it-ebooks.info


|year=2014/month=11|


+---------------------+


--Dropthepartition

jdbc:hive2://>ALTERTABLEemployee_partitioned

.......>DROPIFEXISTSPARTITION(year=2014,month=11);


+---------------------+

|partition|

+---------------------+


+---------------------+


Toavoidmanuallyaddingpartitions,dynamicpartitioninsert(ormultipartitioninsert)isdesignedfordynamicallydeterminingwhichpartitionsshouldbecreatedandpopulatedwhilescanningtheinputtable.ThispartisintroducedwithmoredetailinChapter5,DataManipulation.

Toloadoroverwritedatainpartition,wecanusetheLOADorINSERTOVERWRITEstatements.Thestatementonlyoverwritesthedatainthespecifiedpartitions.Althoughpartitioncolumnsaresubdirectorynames,wecanqueryorspecifythemintheSELECTorWHEREstatementstonarrowdowntheresultset.Thefollowingstepsshowhowtoloaddatatothepartitiontable:

Loaddatatothepartition:

jdbc:hive2://>LOADDATALOCALINPATH

.......>'/home/dayongd/Downloads/employee.txt'

.......>OVERWRITEINTOTABLEemployee_partitioned



Verifythedatathatisloaded:

jdbc:hive2://>SELECTname,year,monthFROMemployee_partitioned;

+----------+-------+--------+

|name|year|month|

+----------+-------+--------+

|Michael|2014|12|

|Will|2014|12|

|Shelley|2014|12|

|Lucy|2014|12|

+----------+-------+--------+


Thealtertable/partitionstatementforfileformat,location,protections,andconcatenationhasthesamesyntaxasthealtertablestatementsandisshownhere:

ALTERTABLEtable_namePARTITIONpartition_specSETFILEFORMAT

file_format;

ALTERTABLEtable_namePARTITIONpartition_specSETLOCATION'full

www.it-ebooks.info


URI';

ALTERTABLEtable_namePARTITIONpartition_specENABLENO_DROP;

ALTERTABLEtable_namePARTITIONpartition_specENABLEOFFLINE;

ALTERTABLEtable_namePARTITIONpartition_specDISABLENO_DROP;

ALTERTABLEtable_namePARTITIONpartition_specDISABLEOFFLINE;

ALTERTABLEtable_namePARTITIONpartition_specCONCATENATE;

www.it-ebooks.info


www.it-ebooks.info


HivebucketsBesidespartition,bucketisanothertechniquetoclusterdatasetsintomoremanageablepartstooptimizequeryperformance.Differentfrompartition,thebucketcorrespondstosegmentsoffilesinHDFS.Forexample,theemployee_partitionedtablefromtheprevioussectionusestheyearandmonthasthetop-levelpartition.Ifthereisafurtherrequesttousetheemployee_idasthethirdlevelofpartition,itleadstomanydeepandsmallpartitionsanddirectories.Forinstance,wecanbuckettheemployee_partitionedtableusingemployee_idasthebucketcolumn.Thevalueofthiscolumnwillbehashedbyauser-definednumberintobuckets.Therecordswiththesameemployee_idwillalwaysbestoredinthesamebucket(segmentoffiles).Byusingbuckets,Hivecaneasilyandefficientlydosampling(seeChapter6,DataAggregationandSampling)andmapsidejoins(seeChapter4,DataSelectionandScope).Anexampletocreateabuckettableisasfollows:

--Prepareanotherdatasetandtableforbuckettable

jdbc:hive2://>CREATETABLEemployee_id

.......>(

.......>namestring,

.......>employee_idint,





.......>)







.......>'/home/dayongd/Downloads/employee_id.txt'

.......>OVERWRITEINTOTABLEemployee_id


--Createthebucketstable

jdbc:hive2://>CREATETABLEemployee_id_buckets

.......>(

.......>namestring,






.......>)

.......>CLUSTEREDBY(employee_id)INTO2BUCKETS






www.it-ebooks.info


NoteBucketnumbers

Todefinethepropernumberofbuckets,weshouldavoidhavingtoomuchortoolittleofdataineachbucket.Abetterchoiceissomewhereneartwoblocksofdata.Forexample,wecanplan512MBofdataineachbucket,iftheHadoopblocksizeis256MB.Ifpossible,use2Nasthenumberofbuckets.

Bucketinghasclosedependencyontheunderlyingdataloaded.Toproperlyloaddatatoabuckettable,weneedtoeithersetthemaximumnumberofreducerstothesamenumberofbucketsspecifiedinthetablecreation(forexample,2)orenableenforcebucketingasfollows:

jdbc:hive2://>setmap.reduce.tasks=2;


jdbc:hive2://>sethive.enforce.bucketing=true;


Topopulatethedatatothebuckettable,wecannotuseLOADkeywordssuchaswhatwasdoneintheregulartablessinceLOADdoesnotverifythedataagainstthemetadata.Instead,INSERTshouldbeusedtopopulatethebuckettableasfollows:

jdbc:hive2://>INSERTOVERWRITETABLEemployee_id_buckets

.......>SELECT*FROMemployee_id;


--VerifythebucketsintheHDFS

-bash-4.1$hdfsdfs-ls/user/hive/warehouse/employee_id_buckets

Found2items

-rwxrwxrwx1hivehive9002014-11-0210:54

/user/hive/warehouse/employee_id_buckets/000000_0

-rwxrwxrwx1hivehive5822014-11-0210:54

/user/hive/warehouse/employee_id_buckets/000001_0

www.it-ebooks.info


www.it-ebooks.info


HiveviewsInHive,viewsarelogicaldatastructuresthatcanbeusedtosimplifyqueriesbyeitherhidingthecomplexitiessuchasjoins,subqueries,andfiltersorbyflattingthedata.UnlikesomeRDBMS,Hiveviewsdonotstoredataorgetmaterialized.OncetheHiveviewiscreated,itsschemaisfrozenimmediately.Subsequentchangestotheunderlyingtables(forexample,addingacolumn)willnotbereflectedintheview’sschema.Ifanunderlyingtableisdroppedorchanged,subsequentattemptstoquerytheinvalidviewwillfail,asfollows:

jdbc:hive2://>CREATEVIEWemployee_skills

.......>AS

.......>SELECTname,skills_score['DB']ASDB,

.......>skills_score['Perl']ASPerl,

.......>skills_score['Python']ASPython,

.......>skills_score['Sales']asSales,

.......>skills_score['HR']asHR



Whencreatingviews,thereisnoMapReducejobtriggeredatallsincethisisonlyametadatachange.However,aproperMapReducejobwillbetriggeredwhenqueryingtheview.UseSHOWCREATETABLEorDESCFORMATTEDTABLEtodisplaytheCREATEVIEWstatementthatcreatedaview.ThefollowingareotherHiveviewDDLs:

Altertheviews’properties:

jdbc:hive2://>ALTERVIEWemployee_skills

.......>SETTBLPROPERTIES('comment'='Thisisaview');


Redefineviews:

jdbc:hive2://>ALTERVIEWemployee_skillsAS

.......>SELECT*fromemployee;


Dropviews:

jdbc:hive2://>DROPVIEWemployee_skills;


www.it-ebooks.info


www.it-ebooks.info


SummaryAftergoingthroughthischapter,weareabletodefineandusevariousdatatypesinHive.Weshouldknowhowtocreate,alter,anddroptables,partitions,andviewsinHiveandhowtouseexternaltables,internaltables,partitions,buckets,andviewsinHive.

Inthenextchapter,wewilldiveintothedetailsofqueryingdatabyHive.

www.it-ebooks.info


www.it-ebooks.info


Chapter4.DataSelectionandScopeThischapterisabouthowtodiscoverthedatabyqueryingthedata,linkingthedata,andlimitingthedatarangesorscopes.ThechaptermainlycoversthesyntaxandusageofHiveSELECT,WHERE,LIMIT,JOIN,andUNIONALLtooperatedatasets.

Inthischapterwewillcoverthefollowingtopics:

TheSELECTstatementThecommonJOINstatementThespecialJOIN(MAPJOIN)statementThesetoperationstatement(UNIONALL)

www.it-ebooks.info


TheSELECTstatementThemostcommonusecaseofusingHiveistoquerythedatainHadoop.Toachievethis,weneedtowriteandexecutetheSELECTstatementinHive.ThetypicalworkdonebytheSELECTstatementistoprojecttherowsmeetingqueryconditionsspecifiedintheWHEREclauseafterthetargettableandreturntheresultset.TheSELECTstatementisquiteoftenusedwithFROM,DISTINCT,WHERE,andLIMITkeywords.Wewillintroducethemthroughexamplesasfollows.

TheSELECT*statementheremeansallthecolumnsinthetableareselected.Bydefault,allrowsarereturnedincludingduplicatedrows.IftheDISTINCTkeywordisused,onlyuniquerowsfromthetableareselectedandreturned.TheLIMITkeywordisusedtolimitthenumberofrowsreturnedrandomly.Inaddition,SELECT*scansthewholetable/filewithouttriggeringMapReducejobs,soitrunsfasterthanSELECT<column_name>.SinceHive0.10.0,thesimpleSELECTstatements,suchasSELECT<column_name>FROM<table_name>LIMITn,canalsoavoidtriggeringtheMapReducejobiftheHivefetchtaskconversionisenabledbysettinghive.fetch.task.conversion=more.

Thefollowingtaskscanbedone:

Queryallorspecificcolumnsinthetable:


+-------+------------------+-----------+----------------+--------------

---------------+

|name|work_place|sex_age|skills_score|

depart_title|

+-------+------------------+-----------+----------------+--------------

---------------+

|Michael|[Montreal,Toronto]|[Male,30]|{DB=80}|{Product=

[Developer,Lead]}|

|Will|[Montreal]|[Male,35]|{Perl=85}|{Test=

[Lead],Product=[Lead]}|

|Shelley|[NewYork]|[Female,27]|{Python=80}|{Test=

[Lead],COE=[Architect]}|

|Lucy|[Vancouver]|[Female,57]|{Sales=89,HR=94}|{Sales=[Lead]}

|

+-------+------------------+-----------+----------------+--------------

---------------+


jdbc:hive2://>SELECTnameFROMemployee;

+----------+

|name|

+----------+

|Michael|

|Will|

|Shelley|

|Lucy|

+----------+


www.it-ebooks.info


Selectauniquevalueofthespecifiedcolumn:

jdbc:hive2://>SELECTDISTINCTnameFROMemployeeLIMIT2;

+----------+

|name|

+----------+

|Lucy|

|Michael|

+----------+


Enablefetchandverifytheperformanceimprovement:

jdbc:hive2://>SEThive.fetch.task.conversion=more;



+----------+

|name|

+----------+

|Michael|

|Will|

|Shelley|

|Lucy|

+----------+


BesidesLIMIT,WHEREisanothergenericconditionclausetolimitthereturnedresultset.TheWHEREconditioncanbeanyBooleanexpressionoruser-definedfunctionscomparingtotableorpartitioncolumns:

jdbc:hive2://>SELECTname,work_placeFROMemployee

.......>WHEREname='Michael';

+----------+-------------------------+

|name|work_place|

+----------+-------------------------+

|Michael|["Montreal","Toronto"]|

+----------+-------------------------+


MultipleSELECTstatementscanworktogethertobuildacomplexqueryusingnestorsubqueries,suchasJOINandUNION.Thefollowingareafewexamplestousenest/subqueries.SubqueriescanbeusedintheformatofWITH(alsoreferredtoasCTEsinceHive0.13.0),aftertheFROMorWHEREstatement.Whenusingsubqueries,analiasshouldbegivenforthesubquery(seet1inthefollowingexample).Orelse,Hivewillreportexceptions.ThedifferentusesofSELECTstatementsareasfollows:

NestedSELECTusingCTEcanbeimplementedasfollows:

jdbc:hive2://>WITHt1AS(

.......>SELECT*FROMemployee

.......>WHEREsex_age.sex='Male')

.......>SELECTname,sex_age.sexASsexFROMt1;

+----------+-------+

|name|sex|

www.it-ebooks.info


+----------+-------+

|Michael|Male|

|Will|Male|

+----------+-------+


NestedSELECTaftertheFROMstatementcanbeimplementedasfollows:

jdbc:hive2://>SELECTname,sex_age.sexASsex

.......>FROM

.......>(

.......>SELECT*FROMemployee

.......>WHEREsex_age.sex='Male'

.......>)t1;

+----------+-------+

|name|sex|

+----------+-------+

|Michael|Male|

|Will|Male|

+----------+-------+


TheHivesubqueryintheWHEREclausecanbeusedwithIN,NOTIN,EXIST,orNOTEXISTasfollows.Ifthealias(seethefollowingexamplefortheemployeetable)isnotspecifiedbeforecolumns(name)intheWHEREcondition,HivewillreporttheerrorCorrelatingexpressioncannotcontainunqualifiedcolumnreferences.ThisisalimitationoftheHivesubquery.AsubquerythatusesEXISTorNOTEXISTmustrefertobothinnerandouterexpression.ThisissimilartotheJOINtable,whichisintroducedlater.ThisisnotsupportedbytheINandNOTINclause.


.......>FROMemployeea

.......>WHEREa.nameIN


.......>WHEREsex_age.sex='Male'

.......>);

+----------+-------+

|name|sex|

+----------+-------+

|Michael|Male|

|Will|Male|

+----------+-------+




.......>WHEREEXISTS

.......>(SELECT*FROMemployeeb

.......>WHEREa.sex_age.sex=b.sex_age.sex

.......>ANDb.sex_age.sex='Male'

.......>);

+----------+-------+

|name|sex|

+----------+-------+

|Michael|Male|

www.it-ebooks.info


|Will|Male|

+----------+-------+


ThereareadditionalrestrictionsforsubqueriesusedinWHEREclauses:

Subqueriescanonlyappearontheright-handsideoftheWHEREclausesNestedsubqueriesarenotallowedTheINandNOTINstatementsupportsonlyonecolumn

www.it-ebooks.info


www.it-ebooks.info


TheINNERJOINstatementHiveJOINisusedtocombinerowsfromtwoormoretablestogether.HivesupportscommonJOINoperationssuchaswhat’sintheRDBMS,forexample,JOIN,LEFTOUTERJOIN,RIGHTOUTERJOIN,FULLOUTERJOIN,andCROSSJOIN.However,HiveonlysupportsequalJOINinsteadofunequalJOIN,becauseunequalJOINisdifficulttobeconvertedtoMapReducejobs.

TheINNERJOINinHiveusesJOINkeywords,whichreturnrowsmeetingtheJOINconditionsfrombothleftandrighttables.TheINNERJOINkeywordcanalsobeomittedbycomma-separatedtablenamessinceHive0.13.0.SeethefollowingexamplestoshowvariousinnerJOINstatementsinHive:

Prepareanothertabletojoinandloaddata:

jdbc:hive2://>CREATETABLEIFNOTEXISTSemployee_hr

.......>(

.......>namestring,


.......>sin_numberstring,

.......>start_datedate

.......>)






.......>'/home/Dayongd/employee_hr.txt'

.......>OVERWRITEINTOTABLEemployee_hr;


PerforminnerJOINbetweentwotableswithequalJOINconditions:

jdbc:hive2://>SELECTemp.name,emph.sin_number

.......>FROMemployeeemp

.......>JOINemployee_hremphONemp.name=emph.name;

+-----------+------------------+

|emp.name|emph.sin_number|

+-----------+------------------+

|Michael|547-968-091|

|Will|527-948-090|

|Lucy|577-928-094|

+-----------+------------------+


TheJOINoperationcanbeperformedamongmoretables(threetablesinthiscase),asfollows:

jdbc:hive2://>SELECTemp.name,empi.employee_id,emph.sin_number


.......>JOINemployee_hremphONemp.name=emph.name

.......>JOINemployee_idempiONemp.name=empi.name;

+-----------+-------------------+------------------+

www.it-ebooks.info


|emp.name|empi.employee_id|emph.sin_number|

+-----------+-------------------+------------------+

|Michael|100|547-968-091|

|Will|101|527-948-090|

|Lucy|103|577-928-094|

+-----------+-------------------+------------------+


Self-joinisaspecialJOINwhereonetablejoinsitself.Whendoingsuchjoins,adifferentaliasshouldbegiventodistinguishthesametable:

jdbc:hive2://>SELECTemp.name


.......>JOINemployeeemp_b

.......>ONemp.name=emp_b.name;

+-----------+

|emp.name|

+-----------+

|Michael|

|Will|

|Shelley|

|Lucy|

+-----------+


ImplicitjoinisaJOINoperationwithoutusingtheJOINkeyword.ItissupportedsinceHive0.13.0:


.......>FROMemployeeemp,employee_hremph

.......>WHEREemp.name=emph.name;

+-----------+------------------+


+-----------+------------------+

|Michael|547-968-091|

|Will|527-948-090|

|Lucy|577-928-094|

+-----------+------------------+


TheJOINoperationusesdifferentcolumnsinjoinconditionsandwillcreateanadditionalMapReduce:

jdbc:hive2://>SELECTemp.name,empi.employee_id,emph.sin_number



.......>JOINemployee_idempiONemph.employee_id=

empi.employee_id;

+-----------+-------------------+------------------+

|emp.name|empi.employee_id|emph.sin_number|

+-----------+-------------------+------------------+

|Michael|100|547-968-091|

|Will|101|527-948-090|

|Lucy|103|577-928-094|

+-----------+-------------------+------------------+


www.it-ebooks.info


NoteIfJOINusesdifferentcolumnsinthejoinconditions,itwillrequestadditionaljobstagestocompletethejoin.IftheJOINoperationusesthesamecolumninthejoinconditions,Hivewilljoinonthisconditionusingonestage.

WhenJOINisperformedbetweenmultipletables,theMapReducejobsarecreatedtoprocessthedataintheHDFS.Eachofthejobsiscalledastage.Usually,itissuggestedforJOINstatementstoputthebigtablerightattheendforbetterperformanceaswellasavoidingOutOfMemory(OOM)exceptions,becausethelasttableinthesequenceisstreamedthroughthereducerswheretheothersarebufferedinthereducerbydefault.Also,ahint,suchas/*+STREAMTABLE(table_name)*/,canbespecifiedtotellwhichtableisstreamedasfollows:

jdbc:hive2://>SELECT/*+STREAMTABLE(employee_hr)*/

.......>emp.name,empi.employee_id,emph.sin_number



.......>JOINemployee_idempiONemph.employee_id=

empi.employee_id;

www.it-ebooks.info


www.it-ebooks.info


TheOUTERJOINandCROSSJOINstatementsBesidesINNERJOIN,HivealsosupportsregularOUTERJOINandFULLJOIN.ThelogicofsuchJOINisthesametowhat’sintheRDBMS.ThefollowingtablesummarizesthedifferencesofacommonJOIN:

CommonJOINtype

LogicRowsreturned(assumetable_mhasmrowsandtable_nhasnrows)

table_m

JOIN

table_n

Thisreturnsallrowsmatchedinbothtables. m∩n

table_m

LEFT

[OUTER]

JOIN

table_n

Thisreturnsallrowsinthelefttableandmatchedrowsintherighttable.Ifthereisnomatchintherighttable,returnnullintherighttable.

m

table_m

RIGHT

[OUTER]

JOIN

table_n

Thisreturnsallrowsintherighttableandmatchedrowsinthelefttable.Ifthereisnomatchinthelefttable,returnnullinthelefttable. n

table_m

FULL

[OUTER]

JOIN

table_n

Thisreturnsallrowsinboththetablesandmatchedrowsinboththetables.Ifthereisnomatchintheleftorrighttable,returnnullinstead. m+n-m∩n

table_m

CROSS

JOIN

table_n

ThisreturnsallrowcombinationsinboththetablestoproduceaCartesianproduct. m*n

ThefollowingexamplesdemonstrateOUTERJOIN:



.......>LEFTJOINemployee_hremphONemp.name=emph.name;

+-----------+------------------+


+-----------+------------------+

|Michael|547-968-091|

|Will|527-948-090|

|Shelley|NULL|

|Lucy|577-928-094|

+-----------+------------------+


www.it-ebooks.info




.......>RIGHTJOINemployee_hremphONemp.name=emph.name;

+-----------+------------------+


+-----------+------------------+

|Michael|547-968-091|

|Will|527-948-090|

|NULL|647-968-598|

|Lucy|577-928-094|

+-----------+------------------+




.......>FULLJOINemployee_hremphONemp.name=emph.name;

+-----------+------------------+


+-----------+------------------+

|Lucy|577-928-094|

|Michael|547-968-091|

|Shelley|NULL|

|NULL|647-968-598|

|Will|527-948-090|

+-----------+------------------+


TheCROSSJOINstatement,whichisavailablesinceHive0.10.0,doesnothavetheJOINcondition.TheCROSSJOINstatementcanalsobewrittenusingJOINwithoutconditionorwiththealwaystruecondition,suchas1=1.ThefollowingthreewaysofwritingCROSSJOINproducethesameresultset:



.......>CROSSJOINemployee_hremph;



.......>JOINemployee_hremph;



.......>JOINemployee_hremphon1=1;

+-----------+------------------+


+-----------+------------------+

|Michael|547-968-091|

|Michael|527-948-090|

|Michael|647-968-598|

|Michael|577-928-094|

|Will|547-968-091|

|Will|527-948-090|

|Will|647-968-598|

|Will|577-928-094|

www.it-ebooks.info


|Shelley|547-968-091|

|Shelley|527-948-090|

|Shelley|647-968-598|

|Shelley|577-928-094|

|Lucy|547-968-091|

|Lucy|527-948-090|

|Lucy|647-968-598|

|Lucy|577-928-094|

+-----------+------------------+


Inaddition,JOINalwayshappensbeforeWHERE.Ifpossible,pushconditionssuchastheJOINconditionsratherthanWHEREconditionstofiltertheresultsetafterJOINimmediately.What’smore,JOINisNOTcommutative!ItisalwaysleftassociativenomatterwhethertheyareLEFTJOINorRIGHTJOIN.

AlthoughHivedoesnotsupportunequalJOINexplicitly,thereareworkaroundsusingCROSSJOINandWHEREconditionsmentionedinthefollowingexample:



.......>JOINemployee_hremphONemp.name<>emph.name;

Error:Errorwhilecompilingstatement:FAILED:SemanticException[Error

10017]:Line1:77BothleftandrightaliasesencounteredinJOIN'name'

(state=42000,code=10017)



.......>CROSSJOINemployee_hremphWHEREemp.name<>emph.name;

+-----------+------------------+


+-----------+------------------+

|Michael|527-948-090|

|Michael|647-968-598|

|Michael|577-928-094|

|Will|547-968-091|

|Will|647-968-598|

|Will|577-928-094|

|Shelley|547-968-091|

|Shelley|527-948-090|

|Shelley|647-968-598|

|Shelley|577-928-094|

|Lucy|547-968-091|

|Lucy|527-948-090|

|Lucy|647-968-598|

+-----------+------------------+


www.it-ebooks.info


www.it-ebooks.info


SpecialJOIN–MAPJOINTheMAPJOINstatementmeansdoingtheJOINoperationonlybymapwithoutthereducejob.TheMAPJOINstatementreadsallthedatafromthesmalltabletomemoryandbroadcaststoallmaps.Duringthemapphase,theJOINoperationisperformedbycomparingeachrowofdatainthebigtablewithsmalltablesagainstthejoinconditions.Becausethereisnoreduceneeded,theJOINperformanceisimproved.Whenthehive.auto.convert.joinsettingissettotrue,HiveautomaticallyconvertstheJOINtoMAPJOINatruntimeifpossibleinsteadofcheckingthemapjoinhint.Inaddition,MAPJOINcanbeusedforunequaljoinstoimproveperformancesincebothMAPJOINandWHEREareperformedinthemapphase.ThefollowingisanexampleofMAPJOINthatisenabledbyqueryhint:

jdbc:hive2://>SELECT/*+MAPJOIN(employee)*/emp.name,emph.sin_number


.......>CROSSJOINemployee_hremphWHEREemp.name<>emph.name;

TheMAPJOINoperationdoesnotsupportthefollowing:

TheuseofMAPJOINafterUNIONALL,LATERALVIEW,GROUPBY/JOIN/SORTBY/CLUSTERBY/DISTRIBUTEBYTheuseofMAPJOINbeforeUNION,JOIN,andanotherMAPJOIN

ThebucketmapjoinisaspecialtypeofMAPJOINthatusesbucketcolumns(thecolumnspecifiedbyCLUSTEREDBYintheCREATEtablestatement)asthejoincondition.Insteadoffetchingthewholetableasdonebytheregularmapjoin,bucketmapjoinonlyfetchestherequiredbucketdata.Toenablebucketmapjoin,weneedtosethive.optimize.bucketmapjoin=trueandmakesurethebucketsnumberisamultipleofeachother.Ifbothtablesjoinedaresortedandbucketedwiththesamenumberofbuckets,asort-mergejoincanbeperformedinsteadofcachingallsmalltablesinthememory.Thefollowingadditionalsettingsareneededtoenablethisbehavior:

SEThive.optimize.bucketmapjoin=true;

SEThive.optimize.bucketmapjoin.sortedmerge=true;

SET

hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

TheLEFTSEMIJOINstatementisalsoatypeofMAPJOIN.BeforeHivesupportsIN/EXIST,LEFTSEMIJOINisusedtoimplementsucharequestasshowninthefollowingexample.TherestrictionofusingLEFTSEMIJOINisthattheright-handsidetableshouldonlybereferencedinthejoincondition,butnotinWHEREorSELECTclauses.

jdbc:hive2://>SELECTa.name


.......>WHEREEXISTS

.......>(SELECT*FROMemployee_idb

.......>WHEREa.name=b.name);



www.it-ebooks.info


www.it-ebooks.info


ImplementUNIONbetweentwotableswithoutduplications:

jdbc:hive2://>SELECTDISTINCTname

.......>FROM

.......>(

.......>SELECTa.nameASname


.......>UNIONALL

.......>SELECTb.nameASname

.......>FROMemployee_hrb

.......>)union_set;

+----------+

|name|

+----------+

|Lucy|

|Michael|

|Shelley|

|Steven|

|Will|

+----------+


NoteThesubqueryalias(suchasunion_setinthisexample)mustbegiventoavoidaHivesyntaxerror.

TheemployeetableimplementsINTERCEPTonemployee_hrusingJOIN:



.......>JOINemployee_hrb

.......>ONa.name=b.name;

+----------+

|a.name|

+----------+

|Michael|

|Will|

|Lucy|

+----------+


TheemployeetableimplementsMINUSonemployee_hrusingOUTERJOIN:



.......>LEFTJOINemployee_hrb

.......>ONa.name=b.name

.......>WHEREb.nameISNULL;

+----------+

|a.name|

+----------+

|Shelley|

+----------+


www.it-ebooks.info


www.it-ebooks.info


SummaryInthischapter,youlearnedtouseSELECTstatementstodiscoverthedatayouneed.Then,weintroducedHiveoperationstolinkdifferentdatasetsfromverticalorhorizontaldirectionsusingJOINorUNIONALL.Aftergoingthroughthischapter,weshouldbeabletousetheSELECTstatementwithdifferentWHEREconditions,LIMIT,DISTINCT,andcomplexsubqueries.WeshouldbeabletounderstandandusedifferenttypesofJOINstatementstolinkthedifferentdatasetshorizontallyandUNIONALLtocombinethedifferentdatasetsvertically.

Inthenextchapter,wewilltalkaboutthedetailsofexchange,order,andtransformingdataaswellastransactionsinHive.

www.it-ebooks.info


www.it-ebooks.info


Chapter5.DataManipulationTheabilitytomanipulatedataisacriticalcapabilityinbigdataanalysis.Manipulatingdataistheprocessofexchanging,moving,sorting,andtransformingthedata.Thistechniqueisusedinmanysituations,suchascleaningdata,searchingpatterns,creatingtrends,andsoon.Hiveoffersvariousquerystatements,keywords,operators,andfunctionstocarryoutdatamanipulation.


DataexchangeusingLOAD,INSERT,IMPORT,andEXPORTOrderandsortOperatorsandfunctionsTransaction

www.it-ebooks.info


Dataexchange–LOADTomovedatainHive,itusestheLOADkeyword.Moveheremeanstheoriginaldataismovedtothetargettable/partitionanddoesnotexistintheoriginalplaceanymore.ThefollowingisanexampleofhowtomovedatatotheHivetableorpartitionfromlocalorHDFSfiles.TheLOCALkeywordspecifieswherethefilesarelocatedinthehost.IftheLOCALkeywordisnotspecified,thefilesareloadedfromthefullUniformResourceIdentifier(URI)specifiedafterINPATHorthevaluefromthefs.default.nameHivepropertybydefault.ThepathafterINPATHcanbearelativepathoranabsolutepath.Thepatheitherpointstoafileorafolder(allfilesinthefolder)tobeloaded,butthesubfolderisnotallowedinthepathspecified.Ifthedataisloadedintoapartitiontable,thepartitioncolumnmustbespecified.TheOVERWRITEkeywordisusedtodecidewhethertoappendorreplacetheexistingdatainthetargettable/partition.

ThefollowingaretheexamplestoloadfilesintoHivetables:

LoadlocaldatatotheHivetable:


.......>'/home/dayongd/Downloads/employee_hr.txt'

.......>OVERWRITEINTOTABLEemployee_hr;


LoadlocaldatatotheHivepartitiontable:


.......>'/home/dayongd/Downloads/employee.txt'

.......>OVERWRITEINTOTABLEemployee_partitioned



LoadHDFSdatatotheHivetableusingthedefaultsystempath:

jdbc:hive2://>LOADDATAINPATH

.......>'/user/dayongd/employee/employee.txt'



LoadHDFSdatatotheHivetablewithfullURI:

jdbc:hive2://>LOADDATAINPATH

.......>

'hdfs://[dfs_host]:8020/user/dayongd/employee/employee.txt'



www.it-ebooks.info


www.it-ebooks.info


Dataexchange–INSERTToextractthedatafromHivetables/partitions,wecanusetheINSERTkeyword.LikeRDBMS,Hivesupportsinsertingdatabyselectingdatafromothertables.Thisisaverycommonwaytopopulateatablefromexistingdata.ThebasicINSERTstatementhasthesamesyntaxasarelationaldatabase’sINSERT.However,HivehasimproveditsINSERTstatementbysupportingOVERWRITE,multipleINSERT,dynamicpartitionINSERT,aswellasusingINSERTtofiles.Thefollowingareafewexamples:

ThefollowingisaregularINSERTfromtheSELECTstatement:

--Checkthetargettable,whichisempty.

jdbc:hive2://>SELECTname,work_place,sex_age


+-------------+-------------------+----------------+

|employee.name|employee.work_place|employee.sex_age|

+-------------+-------------------+----------------+

+-------------+-------------------+----------------+


--PopulatedatafromSELECT

jdbc:hive2://>INSERTINTOTABLEemployee

.......>SELECT*FROMctas_employee;


--Verifythedataloaded

jdbc:hive2://>SELECTname,work_place,sex_ageFROMemployee;

+-------------+----------------------+-------------------------+

|employee.name|employee.work_place|employee.sex_age|

+-------------+----------------------+-------------------------+

|Michael|["Montreal","Toronto"]|{"sex":"Male","age":30}|

|Will|["Montreal"]|{"sex":"Male","age":35}|

|Shelley|["NewYork"]|{"sex":"Female","age":27}|

|Lucy|["Vancouver"]|{"sex":"Female","age":57}|

+-------------+----------------------+-------------------------+


InsertdatafromtheCTEstatement:

jdbc:hive2://>WITHaAS(SELECT*FROMctas_employee)

.......>FROMa

.......>INSERTOVERWRITETABLEemployee

.......>SELECT*;


RunmultipleINSERTbyonlyscanningthesourcetableonce:

jdbc:hive2://>FROMctas_employee

.......>INSERTOVERWRITETABLEemployee

.......>SELECT*

.......>INSERTOVERWRITETABLEemployee_internal

.......>SELECT*;


www.it-ebooks.info


NoteTheINSERTOVERWRITEstatementwillreplacethedatainthetargettable/partitionwhileINSERTINTOwillappenddata.

Wheninsertingdatatothepartitions,weneedtospecifythepartitioncolumns.Insteadofspecifyingstaticvaluesforstaticpartitions,Hivealsosupportsdynamicallygivingpartitionvalues.Dynamicpartitionsareusefulwhenthedatavolumeislargeandwedon’tknowwhatwillbethepartitionvalues.Forexample,thedateisdynamicallyusedaspartitioncolumns.

Dynamicpartitionisnotenabledbydefault.Weneedtosetthefollowingpropertiestomakeitwork:

jdbc:hive2://>SEThive.exec.dynamic.partition=true;


Bydefault,theusermustspecifyatleastonestaticpartitioncolumn.Thisistoavoidaccidentallyoverwritingpartitions.Todisablethisrestriction,wecansetthepartitionmodetononstrictfromthedefaultstrictmodebeforeinsertingintodynamicpartitionsasfollows:

jdbc:hive2://>SEThive.exec.dynamic.partition.mode=nonstrict;


jdbc:hive2://>INSERTINTOTABLEemployee_partitioned

.......>PARTITION(year,month)

.......>SELECTname,array('Toronto')aswork_place,

.......>named_struct("sex","Male","age",30)assex_age,

.......>map("Python",90)asskills_score,

.......>map("R&D",array('Developer'))asdepart_title,

.......>year(start_date)asyear,month(start_date)asmonth

.......>FROMemployee_hreh

.......>WHEREeh.employee_id=102;


NoteComplextypeconstructorsareusedintheprecedingexampletoassignaconstantvaluetoacomplexdatatypecolumn.

TheHiveINSERTtofilesstatementistheoppositeoperationforLOAD.ItextractsthedatafromSELECTstatementstolocalorHDFSfiles.However,itonlysupportstheOVERWRITEkeyword,notINTO.Thismeanswecannotappenddataextractedtotheexistingfiles.Bydefault,thecolumnsareseparatedby^Aandrowsareseparatedbynewlines.SinceHive0.11.0,rowseparatorscanbespecified.Thefollowingareafewexamplestoinsertdatatofiles:

Wecaninserttolocalfileswithdefaultrowseparators.InsomerecentversionofHadoop,thelocaldirectorypathonlyworksforadirectorylevellessthantwo.Wemayneedtosethive.insert.into.multilevel.dirs=truetogetthisfixed:

jdbc:hive2://>INSERTOVERWRITELOCALDIRECTORY'/tmp/output1'

www.it-ebooks.info


.......>SELECT*FROMemployee;


NoteBydefault,manypartialfilescouldbecreatedbythereducerwhendoingINSERT.Tomergethemintoone,wecanuseHDFScommands,asshowninthefollowingexample:

hdfsdfs–getmergehdfs://<host_name>:8020/user/dayongd/output

/tmp/test

Inserttolocalfileswithspecifiedrowseparators:

jdbc:hive2://>INSERTOVERWRITELOCALDIRECTORY'/tmp/output2'

.......>ROWFORMATDELIMITEDFIELDSTERMINATEDBY','

.......>SELECT*FROMemployee;


--Verifytheseparator

vi/tmp/output2/000000_0

Michael,Montreal^BToronto,Male^B30,DB^C80,Product^CDeveloper^DLead

Will,Montreal,Male^B35,Perl^C85,Product^CLead^BTest^CLead

Shelley,NewYork,Female^B27,Python^C80,Test^CLead^BCOE^CArchitect

Lucy,Vancouver,Female^B57,Sales^C89^BHR^C94,Sales^CLead

FiremultipleINSERTstatementsfromthesametableSELECTstatement:

jdbc:hive2://>FROMemployee

.......>INSERTOVERWRITEDIRECTORY'/user/dayongd/output'

.......>SELECT*

.......>INSERTOVERWRITEDIRECTORY'/user/dayongd/output1'

.......>SELECT*;


NoteBesidestheHiveINSERTstatement,HiveandHDFSshellcommandscanalsobeusedtoextractdatatolocalorremotefileswithbothappendandoverwritesupport.Thehive-e'quoted_hql_string'orhive-f<hql_filename>commandscanexecuteaHivequerystatementorqueryfile.Linuxredirectoperatorsandpipingcanbeusedwiththesecommandstoredirectresultsets.Thefollowingareafewexamples:

Appendtolocalfiles:

$hive-e'select*fromemployee'>>test

Overwritetolocalfiles:

$hive-e'select*fromemployee'>test

AppendtoHDFSfiles:

$hive-e'select*fromemployee'|hdfsdfs-appendToFile-

/user/dayongd/output2/test

OverwritetoHDFSfiles:

www.it-ebooks.info


$hive-e'select*fromemployee'|hdfsdfs-put-f-

/user/dayongd/output2/test

www.it-ebooks.info


www.it-ebooks.info


Dataexchange–EXPORTandIMPORTWhenworkingwithHive,sometimesweneedtomigratedataamongdifferentenvironments.Orwemayneedtobackupsomedata.SinceHive0.8.0,EXPORTandIMPORTstatementsareavailabletosupporttheimportandexportofdatainHDFSfordatamigrationorbackup/restorepurposes.

TheEXPORTstatementwillexportbothdataandmetadatafromatableorpartition.Metadataisexportedinafilecalled_metadata.Dataisexportedinasubdirectorycalleddata:

jdbc:hive2://>EXPORTTABLEemployeeTO'/user/dayongd/output3';


AfterEXPORT,wecanmanuallycopytheexportedfilestootherHiveinstancesoruseHadoopdistcpcommandstocopytootherHDFSclusters.Then,wecanimportthedatainthefollowingmanner:

Importdatatoatablewiththesamename.Itthrowsanerrorifthetableexists:

jdbc:hive2://>IMPORTFROM'/user/dayongd/output3';

Error:Errorwhilecompilingstatement:FAILED:SemanticException

[Error10119]:Tableexistsandcontainsdatafiles

(state=42000,code=10119)

Importdatatoanewtable:

jdbc:hive2://>IMPORTTABLEempolyee_importedFROM

.......>'/user/dayongd/output3';


Importdatatoanexternaltable,wheretheLOCATIONpropertyisoptional:

jdbc:hive2://>IMPORTEXTERNALTABLEempolyee_imported_external

.......>FROM'/user/dayongd/output3'

.......>LOCATION'/user/dayongd/output4';


Exportandimportpartitions:

jdbc:hive2://>EXPORTTABLEemployee_partitionedpartition

.......>(year=2014,month=11)TO'/user/dayongd/output5';


jdbc:hive2://>IMPORTTABLEemployee_partitioned_imported

.......>FROM'/user/dayongd/output5';


www.it-ebooks.info


www.it-ebooks.info


ORDERandSORTAnotheraspecttomanipulatedatainHiveistoproperlyorderorsortthedataorresultsetstoclearlyidentifytheimportantfacts,suchastopNvalues,maximum,minimum,andsoon.

TherearethefollowingkeywordsusedinHivetoorderandsortdata:

ORDERBY(ASC|DESC):ThisissimilartotheRDBMSORDERBYstatement.Asortedorderismaintainedacrossalloftheoutputfromeveryreducer.Itperformstheglobalsortusingonlyonereducer,soittakesalongertimetoreturntheresult.UsagewithLIMITisstronglyrecommendedforORDERBY.Whenhive.mapred.mode=strict(bydefault,hive.mapred.mode=nonstrict)issetandwedonotspecifyLIMIT,thereareexceptions.Thiscanbeusedasfollows:

jdbc:hive2://>SELECTnameFROMemployeeORDERBYNAMEDESC;

+----------+

|name|

+----------+

|Will|

|Shelley|

|Michael|

|Lucy|

+----------+


SORTBY(ASC|DESC):Thisindicateswhichcolumnstosortwhenorderingthereducerinputrecords.Thismeansitcompletessortingbeforesendingdatatothereducer.TheSORTBYstatementdoesnotperformaglobalsortandonlymakessuredataislocallysortedineachreducerunlesswesetmapred.reduce.tasks=1.Inthiscase,itisequaltotheresultofORDERBY.Itcanbeusedasfollows:

--Usemorethan1reducer

jdbc:hive2://>SETmapred.reduce.tasks=2;


jdbc:hive2://>SELECTnameFROMemployeeSORTBYNAMEDESC;

+----------+

|name|

+----------+

|Shelley|

|Michael|

|Lucy|

|Will|

+----------+


--Useonly1reducer

jdbc:hive2://>SETmapred.reduce.tasks=1;


jdbc:hive2://>SELECTnameFROMemployeeSORTBYNAMEDESC;

+----------+

www.it-ebooks.info


|name|

+----------+

|Will|

|Shelley|

|Michael|

|Lucy|

+----------+


DISTRIBUTEBY:Rowswithmatchingcolumnvalueswillbepartitionedtothesamereducer.Whenusedalone,itdoesnotguaranteesortedinputtothereducer.TheDISTRIBUTEBYstatementissimilartoGROUPBYinRDBMSintermsofdecidingwhichreducertodistributethemapperoutputto.WhenusingwithSORTBY,DISTRIBUTEBYmustbespecifiedbeforetheSORTBYstatement.And,thecolumnusedtodistributemustappearintheselectcolumnlist.Itcanbeusedasfollows:

jdbc:hive2://>SELECTname

.......>FROMemployee_hrDISTRIBUTEBYemployee_id;


[Error10004]:Line1:44Invalidtablealiasorcolumnreference

'employee_id':(possiblecolumnnamesare:name)

(state=42000,code=10004)

jdbc:hive2://>SELECTname,employee_id

.......>FROMemployee_hrDISTRIBUTEBYemployee_id;

+----------+--------------+

|name|employee_id|

+----------+--------------+

|Lucy|103|

|Steven|102|

|Will|101|

|Michael|100|

+----------+--------------+


--UsedwithSORTBY


.......>FROMemployee_hr

.......>DISTRIBUTEBYemployee_idSORTBYname;

+----------+--------------+

|name|employee_id|

+----------+--------------+

|Lucy|103|

|Michael|100|

|Steven|102|

|Will|101|

+----------+--------------+


CLUSTERBY:ThisisashorthandoperatortoperformDISTRIBUTEBYandSORTBYoperationsonthesamegroupofcolumns.And,itissortedlocallyineachreducer.TheCLUSTERBYstatementdoesnotsupportASCorDESCyet.ComparedtoORDERBY,whichisgloballysorted,theCLUSTERBYoperationissortedineachdistributedgroup.Tofullyutilizealltheavailablereducerswhendoingaglobalsort,wecando

www.it-ebooks.info


www.it-ebooks.info


OperatorsandfunctionsTofurthermanipulatedata,wecanalsouseexpressions,operators,andfunctionsinHivetotransformdata.TheHivewiki(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF)hasofferedspecificationsforeachexpressionandfunction,sowedonotwanttorepeatallofthemhereexceptafewimportantusagesortipsinthischapter.

Hivehasdefinedrelationaloperators,arithmeticoperators,logicaloperators,complextypeconstructors,andcomplextypeoperators.Forrelational,arithmetic,andlogicaloperators,theyaresimilartostandardoperatorsinSQL/Java.Wedonotrepeatthemagaininthischapter.Foroperatorsonacomplexdatatype,wehavealreadyintroducedthemintheUnderstandingHivedatatypessectionofChapter3,DataDefinitionandDescription,aswellastheexampleforadynamicpartitioninsertinthischapter.

ThefunctionsinHivearecategorizedasfollows:

Mathematicalfunctions:Thesefunctionsaremainlyusedtoperformmathematicalcalculations,suchasRAND()andE().Collectionfunctions:Thesefunctionsareusedtofindthesize,keys,andvaluesforcomplextypes,suchasSIZE(Array<T>).Typeconversionfunctions:ThesearemainlyCASTandBINARYfunctionstoconvertonetypetotheother.Datefunctions:Thesefunctionsareusedtoperformdate-relatedcalculations,suchasYEAR(stringdate)andMONTH(stringdate).Conditionalfunctions:Thesefunctionsareusedtocheckspecificconditionswithadefinedvaluereturned,suchasCOALESCE,IF,andCASEWHEN.Stringfunctions:Thesefunctionsareusedtoperformstring-relatedoperations,suchasUPPER(stringA)andTRIM(stringA).Aggregatefunctions:Thesefunctionsareusedtoperformaggregation(whichisintroducedinthenextchapterformoredetails),suchasSUM(),COUNT(*).Table-generatingfunctions:Thesefunctionstransformasingleinputrowintomultipleoutputrows,suchasEXPLODE(MAP)andJSON_TUPLE(jsonString,k1,k2,…).Customizedfunctions:ThesefunctionsarecreatedbyJavacodeasextensionsforHive.TheyareintroducedinChapter8,ExtensibilityConsiderations.

TolistHivebuilt-infunctions/UDF,wecanusethefollowingcommandsinHiveCLI:

SHOWFUNCTIONS;--Listallfunctions

DESCRIBEFUNCTION<function_name>;--Detailforspecifiedfunction

DESCRIBEFUNCTIONEXTENDED<function_name>;--Evenmoredetails

Thefollowingareafewexamplesandtipsforusingthesefunctions:

Complexdatatypefunctionstips:TheSIZEtypeisusedtocalculatethesizeforMAP,ARRAY,ornestedMAP/ARRAY.Itreturns-1ifthesizeisunknown.Itcanbeimplementedasfollows:

www.it-ebooks.info

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


jdbc:hive2://>SELECTwork_place,skills_score,depart_title


+----------------------+--------------------+--------------------------

-----------+

|work_place|skills_score|depart_title

|

+----------------------+--------------------+--------------------------

-----------+

|["Montreal","Toronto"]|{"DB":80}|{"Product":

["Developer","Lead"]}|

|["Montreal"]|{"Perl":85}|{"Product":

["Lead"],"Test":["Lead"]}|

|["NewYork"]|{"Python":80}|{"Test":["Lead"],"COE":

["Architect"]}|

|["Vancouver"]|{"Sales":89,"HR":94}|{"Sales":["Lead"]}

|

+----------------------+--------------------+--------------------------

-----------+


jdbc:hive2://>SELECTSIZE(work_place)ASarray_size,

.......>SIZE(skills_score)ASmap_size,

.......>SIZE(depart_title)AScomplex_size,

.......>SIZE(depart_title["Product"])ASnest_size


+-------------+-----------+---------------+------------+

|array_size|map_size|complex_size|nest_size|

+-------------+-----------+---------------+------------+

|2|1|1|2|

|1|1|2|1|

|1|1|2|-1|

|1|2|1|-1|

+-------------+-----------+---------------+------------+


TheARRAY_CONTAINSstatementcheckswhetherthearraycontainssomevaluestoreturnTRUEorFALSE.TheSORT_ARRAYstatementsortsthearrayinascendingorder.Thesecanbeusedasfollows:

jdbc:hive2://>SELECTARRAY_CONTAINS(work_place,'Toronto')

.......>ASis_Toronto,

.......>SORT_ARRAY(work_place)ASsorted_array


+-------------+-------------------------+

|is_toronto|sorted_array|

+-------------+-------------------------+

|true|["Montreal","Toronto"]|

|false|["Montreal"]|

|false|["NewYork"]|

|false|["Vancouver"]|

+-------------+-------------------------+


Datefunctiontips:TheFROM_UNIXTIME(UNIX_TIMESTAMP())statementperformsthesamefunctionasSYSDATEinOracle.Itdynamicallyreturnsthecurrentdate-timein

www.it-ebooks.info


theHiveserver,asfollows:

jdbc:hive2://>SELECT

.......>FROM_UNIXTIME(UNIX_TIMESTAMP())AScurrent_time

.......>FROMemployeeLIMIT1;

+----------------------+

|current_time|

+----------------------+

|2014-11-1519:28:29|

+----------------------+


TheUNIX_TIMESTAMP()statementcanbeusedtocomparetwodatesorcanbeusedafterORDERBYtoproperlyorderthedifferentstringtypesofadatevalue,suchasORDERBYUNIX_TIMESTAMP(string_date,'dd-MM-yyyy').Thiscanbeusedasfollows:

--Tocomparethedifferencebetweentwodates.

jdbc:hive2://>SELECT(UNIX_TIMESTAMP('2015-01-2118:00:00')

.......>-UNIX_TIMESTAMP('2015-01-1011:00:00'))/60/60/24

.......>ASdaydiffFROMemployeeLIMIT1;

+---------------------+

|daydiff|

+---------------------+

|11.291666666666666|

+---------------------+


TheTO_DATEstatementremovesthehours,minutes,andsecondsfromadate.Thisisusefulwhenweneedtocheckwhetherthevalueofdate-timetypecolumnsiswithinthedatarange,suchasWHERETO_DATE(update_datetime)BETWEEN'2014-11-01'AND'2014-11-31'.Thiscanbeusedasfollows:

jdbc:hive2://>SELECTTO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP()))

.......>AScurrent_dateFROMemployeeLIMIT1;

+---------------+

|current_date|

+---------------+

|2014-11-15|

+---------------+


CASEfordifferentdatatypes:BeforeHive0.13.0,thedatatypeafterTHENorELSEneededtobethesame.Otherwise,itwouldgiveanexception,suchasTheexpressionafterELSEshouldhavethesametypeasthoseafterTHEN:“bigint”isexpectedbut“int”isfound.TheworkaroundistouseIF.InHive0.13.0,thisgetsfixed,asshownhere:


.......>CASEWHEN1ISNULLTHEN'TRUE'ELSE0END

.......>AScase_resultFROMemployeeLIMIT1;

+--------------+

|case_result|

+--------------+

www.it-ebooks.info


|0|

+--------------+


Parserandsearchtips:TheLATERALVIEWstatementisusedwithuser-definedtablegeneratingfunctionssuchasEXPLODE()toflattenthemaporarraytypeofacolumn.TheexplodefunctioncanbeusedonbothARRAYandMAPwithLATERALVIEW.IfevenoneofthecolumnsexplodedisNULL,thewholerowisfilteredout,suchastherowofSteveninthefollowingexample.Toavoidthis,OUTERLATERALVIEWcanbeusedasfollowssinceHive0.12.0:

--Preparedata

jdbc:hive2://>INSERTINTOTABLEemployee

.......>SELECT'Steven'ASname,array(null)aswork_place,

.......>named_struct("sex","Male","age",30)assex_age,

.......>map("Python",90)asskills_score,

.......>map("R&D",array('Developer'))asdepart_title



jdbc:hive2://>SELECTname,work_place,skills_score


+----------+-------------------------+-----------------------+

|name|work_place|skills_score|

+----------+-------------------------+-----------------------+

|Michael|["Montreal","Toronto"]|{"DB":80}|

|Will|["Montreal"]|{"Perl":85}|

|Shelley|["NewYork"]|{"Python":80}|

|Lucy|["Vancouver"]|{"Sales":89,"HR":94}|

|Steven|NULL|{"Python":90}|

+----------+-------------------------+-----------------------+


--LATERALVIEWignorestherowswhenEXPLOREreturnsNULL

jdbc:hive2://>SELECTname,workplace,skills,score

.......>FROMemployee

.......>LATERALVIEWexplode(work_place)wpASworkplace

.......>LATERALVIEWexplode(skills_score)ss

.......>ASskills,score;

+----------+------------+---------+--------+

|name|workplace|skills|score|

+----------+------------+---------+--------+

|Michael|Montreal|DB|80|

|Michael|Toronto|DB|80|

|Will|Montreal|Perl|85|

|Shelley|NewYork|Python|80|

|Lucy|Vancouver|Sales|89|

|Lucy|Vancouver|HR|94|

+----------+------------+---------+--------+


--OUTERLATERALVIEWkeepsrowswhenEXPLOREreturnsNULL

jdbc:hive2://>SELECTname,workplace,skills,score


.......>LATERALVIEWOUTERexplode(work_place)wp

www.it-ebooks.info


Rightnow,Hiveofferstwovirtualcolumns:INPUT__FILE__NAMEandBLOCK__OFFSET__INSIDE__FILE.TheINPUT__FILE__NAMEfunctionistheinputfile’snameforamappertask.TheBLOCK__OFFSET__INSIDE__FILEfunctionisthecurrentglobalfilepositionorcurrentblock’sfileoffsetifthefileiscompressed.ThefollowingareexamplestousevirtualcolumnstoknowtheplacewherethedataisphysicallylocatedintheHDFS,especiallyforbucketedandpartitionedtables:

jdbc:hive2://>SELECTINPUT__FILE__NAME,

.......>BLOCK__OFFSET__INSIDE__FILEASOFFSIDE

.......>FROMemployee_id_buckets;

+---------------------------------------------------------+----------+

|input__file__name|offside|

+---------------------------------------------------------+----------+

|hdfs://hive_warehouse_URI/employee_id_buckets/000000_0|0|

























+---------------------------------------------------------+----------+


jdbc:hive2://>SELECTINPUT__FILE__NAMEFROMemployee_partitioned;

+----------------------------------------------------------------------

---+

|input__file__name

|

+----------------------------------------------------------------------

---+

|hdfs://warehouse_URI/employee_partitioned/year=2010/month=1/000000_0


|hdfs://warehouse_URI/employee_partitioned/year=2014/month=12/employee.

txt


txt

www.it-ebooks.info


www.it-ebooks.info


TransactionsBeforeHiveversion0.13.0,Hivedoesnotsupportrow-leveltransactions.Asaresult,thereisnowaytoupdate,insert,ordeleterowsofdata.Hence,dataoverwritecanonlyhappenontablesorpartitions.ThismakesHiveverydifficultwhendealingwithconcurrentread/writeanddata-cleaningusecases.

SinceHiveversion0.13.0,Hivefullysupportsrow-leveltransactionsbyofferingfullAtomicity,Consistency,Isolation,andDurability(ACID)toHive.Fornow,allthetransactionsareautocommutedandonlysupportdataintheOptimizedRowColumnar(ORC)file(availablesinceHive0.11.0)formatandinbucketedtables.

ThefollowingconfigurationparametersmustbesetappropriatelytoturnontransactionsupportinHive:

SEThive.support.concurrency=true;

SEThive.enforce.bucketing=true;

SEThive.exec.dynamic.partition.mode=nonstrict;

SEThive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

SEThive.compactor.initiator.on=true;

SEThive.compactor.worker.threads=1;

TheSHOWTRANSACTIONScommandisaddedsinceHive0.13.0toshowcurrentlyopenandabortedtransactionsinthesystem:

jdbc:hive2://>SHOWTRANSACTIONS;

+-----------------+--------------------+-------+-----------+

|txnid|state|user|host|

+-----------------+--------------------+-------+-----------+

|TransactionID|TransactionState|User|Hostname|

+-----------------+--------------------+-------+-----------+


SinceHive0.14.0,theINSERTVALUE,UPDATE,andDELETEcommandsareaddedtooperaterowswiththefollowingsyntax:

INSERTINTOTABLEtablename[PARTITION(partcol1[=val1],partcol2[=val2]

...)]

VALUESvalues_row[,values_row…];

UPDATEtablenameSETcolumn=value[,column=value…][WHEREexpression]

DELETEFROMtablename[WHEREexpression]

www.it-ebooks.info


www.it-ebooks.info


SummaryInthischapter,wecoveredhowtoexchangedatabetweenHiveandfilesusingtheLOAD,INSERT,IMPORT,andEXPORTkeywords.Then,weintroducedthedifferentHiveorderingandsortingoptions.WealsocoveredsomecommonlyusedtipsusingHivefunctions.Finally,weprovidedanoverviewofrow-leveltransactionsthatarenewlysupportedsinceHive0.13.0.Aftergoingthroughthischapter,weshouldbeabletoimportorexportdatatoHive.Weshouldbeexperiencedinusingdifferenttypesoforderingandsortingkeywords,Hivefunctions,andtransactions.

Inthenextchapter,we’lllookatthedifferentwaysofcarryingoutdataaggregationsandsamplinginHive.

www.it-ebooks.info


www.it-ebooks.info


Chapter6.DataAggregationandSamplingThischapterisabouthowtoaggregateandsampledatainHive.Itfirstlycoverstheusageofseveralaggregationfunctions,analyticfunctionsworkingwithGROUPBYandPARTITIONBY,andwindowingclauses.Then,itintroducesdifferentwaysofsamplingdatainHive.


BasicaggregationAdvancedaggregationAggregationconditionAnalyticfunctionsSampling

www.it-ebooks.info


Basicaggregation–GROUPBYDataaggregationisanyprocesstogatherandexpressdatainasummaryformtogetmoreinformationaboutparticulargroupsbasedonspecificconditions.Hiveoffersseveralbuilt-inaggregatefunctions,suchasMAX,MIN,AVG,andsoon.HivealsosupportsadvancedaggregationbyusingGROUPINGSETS,ROLLUP,CUBE,analyticfunctions,andwindowing.

TheHivebasicbuilt-inaggregatefunctionsareusuallyusedwiththeGROUPBYclause.IfthereisnoGROUPBYclausespecified,itaggregatesoverthewholetablebydefault.Besidesaggregatefunctions,allothercolumnsthatareselectedmustalsobeincludedintheGROUPBYclause.Thefollowingareafewexamplesusingthebuilt-inaggregatefunctions:

AggregationwithoutGROUPBYcolumns:

jdbc:hive2://>SELECTcount(*)ASrow_cntFROMemployee;

+----------+

|row_cnt|

+----------+

|5|

+----------+


AggregationwithGROUPBYcolumns:

jdbc:hive2://>SELECTsex_age.sex,count(*)ASrow_cnt


.......>GROUPBYsex_age.sex;

+--------------+----------+

|sex_age.sex|row_cnt|

+--------------+----------+

|Female|2|

|Male|3|

+--------------+----------+


--Thecolumnnameselectedisnotgroupbycolumns

jdbc:hive2://>SELECTname,sex_age.sex,count(*)ASrow_cnt

.......>FROMemployeeGROUPBYsex_age.sex;


[Error10025]:Line1:7ExpressionnotinGROUPBYkey'name'

(state=42000,code=10025)

IfwehavetoselectthecolumnsthatarenotGROUPBYcolumns,onewayistouseanalyticfunctions,whichareintroducedlater,tocompletelyavoidusingtheGROUPBYclause.Theotherwayistousethecollect_setfunction,whichreturnsasetofobjectswithduplicateelementseliminatedasfollows:

--Findrowcountbysexandasampledageforeachsex

jdbc:hive2://>SELECTsex_age.sex,

.......>collect_set(sex_age.age)[0]ASrandom_age,

.......>count(*)ASrow_cnt


www.it-ebooks.info


+--------------+-------------+----------+

|sex_age.sex|random_age|row_cnt|

+--------------+-------------+----------+

|Female|27|2|

|Male|35|3|

+--------------+-------------+----------+


Theaggregatefunctioncanbeusedwithotheraggregatefunctionsinthesameselectstatement.Itcanalsobeusedwithotherfunctions,suchasconditionalfunctions,inthenestedway.However,nestedaggregatefunctionsarenotsupported.Seethefollowingexamplesformoredetails:

MultipleaggregatefunctionsarecalledinthesameSELECTstatement,asfollows:

jdbc:hive2://>SELECTsex_age.sex,AVG(sex_age.age)ASavg_age,

.......>count(*)ASrow_cnt


+--------------+---------------------+----------+

|sex_age.sex|avg_age|row_cnt|

+--------------+---------------------+----------+

|Female|42.0|2|

|Male|31.666666666666668|3|

+--------------+---------------------+----------+


TheseaggregatefunctionsareusedwithCASEWHEN,asfollows:

jdbc:hive2://>SELECTsum(CASEWHENsex_age.sex='Male'

.......>THENsex_age.ageELSE0END)/

.......>count(CASEWHENsex_age.sex='Male'THEN1

.......>ELSENULLEND)ASmale_age_avgFROMemployee;

+---------------------+

|male_age_avg|

+---------------------+

|31.666666666666668|

+---------------------+


TheseaggregatefunctionsareusedwithCOALESCEandIF,asfollows:


.......>sum(coalesce(sex_age.age,0))ASage_sum,

.......>sum(if(sex_age.sex='Female',sex_age.age,0))

.......>ASfemale_age_sumFROMemployee;

+----------+---------------+

|age_sum|female_age_sum|

+----------+---------------+

|179|84|

+----------+---------------+


Nestedaggregatefunctionsarenotallowed,asshownhere:

jdbc:hive2://>SELECTavg(count(*))ASrow_cnt



www.it-ebooks.info


[Error10128]:Line1:11NotyetsupportedplaceforUDAF'count'

(state=42000,code=10128)

AggregatefunctionscanalsobeusedwiththeDISTINCTkeywordtodoaggregationonuniquevalues:

jdbc:hive2://>SELECTcount(DISTINCTsex_age.sex)ASsex_uni_cnt,

.......>count(DISTINCTname)ASname_uni_cnt


+--------------+---------------+

|sex_uni_cnt|name_uni_cnt|

+--------------+---------------+

|2|5|

+--------------+---------------+


NoteWhenweuseCOUNTandDISTINCTtogether,Hivealwaysignoresthesetting(suchasmapred.reduce.tasks=20)forthenumberofreducersusedandusesonlyonereducer.Inthiscase,thesinglereducerbecomesthebottleneckwhenprocessingbigvolumesofdata.Theworkaroundistousethesubqueryasfollows:

--Triggersinglereducerduringthewholeprocessing

SELECTcount(distinctsex_age.sex)ASsex_uni_cntFROMemployee;

--Usesubquerytoselectuniquevaluebeforeaggregationsforbetter

performance

SELECTcount(*)ASsex_uni_cntFROM(SELECTdistinctsex_age.sexFROM

employee)a;

Inthiscase,thefirststageofthequeryimplementingDISTINCTcanusemorethanonereducer.Inthesecondstage,themapperwillhavelessoutputjustfortheCOUNTpurposesincethedataisalreadyuniqueafterimplementingDISTINCT.Asaresult,thereducerwillnotbeoverloaded.

WemayencounteraveryspecialbehaviorwhenHivedealswithaggregationacrosscolumnswithaNULLvalue.Theentirerow(ifonecolumnhasNULLasavalueintherow)willbeignoredinthesecondrowofthefollowingexample.Toavoidthis,wecanuseCOALESCEtoassignadefaultvaluewhenthecolumnvalueisNULL.Thiscanbedoneasfollows:

--Createatabletfortesting

jdbc:hive2://>CREATETABLEtASSELECT*FROM

.......>(SELECTemployee_id-99ASval1,

.......>(employee_id-98)ASval2FROMemployee_hr

.......>WHEREemployee_id<=101

.......>UNIONALL

.......>SELECTnullval1,2ASval2FROMemployee_hr

.......>WHEREemployee_id=100)a;


--Checktherowsinthetablecreated

jdbc:hive2://>SELECT*FROMt;

+---------+---------+

www.it-ebooks.info


|t.val1|t.val2|

+---------+---------+

|1|2|

|NULL|2|

|2|3|

+---------+---------+


--The2ndrow(NULL,2)isignoredwhendoingsum(val1+val2)

jdbc:hive2://>SELECTsum(val1),sum(val1+val2)

.......>FROMt;

+------+------+

|_c0|_c1|

+------+------+

|3|8|

+------+------+


jdbc:hive2://>SELECTsum(coalesce(val1,0)),

.......>sum(coalesce(val1,0)+val2)FROMt;

+------+------+

|_c0|_c1|

+------+------+

|3|10|

+------+------+


Thehive.map.aggrpropertycontrolsaggregationsinthemaptask.Thedefaultvalueforthissettingisfalse.Ifitissettotrue,Hivewilldothefirst-levelaggregationdirectlyinthemaptaskforbetterperformance,butconsumemorememory:

jdbc:hive2://>SEThive.map.aggr=true;


www.it-ebooks.info


www.it-ebooks.info


Advancedaggregation–GROUPINGSETSHivehasofferedtheGROUPINGSETSkeywordstoimplementadvancedmultipleGROUPBYoperationsagainstthesamesetofdata.Actually,GROUPINGSETSisashorthandwayofconnectingseveralGROUPBYresultsetswithUNIONALL.TheGROUPINGSETSkeywordcompletesallprocessesinonestageofjobs,whichismoreefficientthanGROUPBYandUNIONALLhavingmultiplestages.Ablankset()intheGROUPINGSETSclausecalculatestheoverallaggregation.ThefollowingareafewexamplestoshowtheequivalenceofGROUPINGSETS.Forbetterunderstanding,wecansaythattheouterlevelofGROUPINGSETSdefinesonwhatdataUNIONALListobeimplemented.TheinnerleveldefinesonwhatdataGROUPBYistobeimplementedineachUNIONALL.

SELECTname,work_place[0]ASmain_place,

count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYname,work_place[0]GROUPINGSETS((name,work_place[0]));

||



FROMemployee_id

GROUPBYname,work_place[0]



FROMemployee_id

GROUPBYname,work_place[0]GROUPINGSETS(name,work_place[0]);

||

SELECTname,NULLASmain_place,count(employee_id)ASemp_id_cnt

FROMemployee_id

GROUPBYname

UNIONALL

SELECTNULLASname,work_place[0]ASmain_place,


FROMemployee_id

GROUPBYwork_place[0];



FROMemployee_id


GROUPINGSETS((name,work_place[0]),name);

||



FROMemployee_id


UNIONALL


FROMemployee_id

GROUPBYname;

www.it-ebooks.info




FROMemployee_id


GROUPINGSETS((name,work_place[0]),name,work_place[0],());

||



FROMemployee_id


UNIONALL


FROMemployee_id

GROUPBYname

UNIONALL

SELECTNULLASname,work_place[0]ASmain_place,


FROMemployee_id

GROUPBYwork_place[0]

UNIONALL

SELECTNULLASname,NULLASmain_place,


FROMemployee_id;

However,theGROUPINGSETSoperationstillhasunresolvedissueswhenworkingwithcolumnsreferredbyatableorrecordtypealias(seeApacheJiraHIVE-6950athttps://issues.apache.org/jira/browse/HIVE-6950).Thisisshownhere:

jdbc:hive2://>SELECTsex_age.sex,sex_age.age,

.......>count(name)ASname_cnt


.......>GROUPBYsex_age.sex,sex_age.age

.......>GROUPINGSETS((sex_age.sex,sex_age.age));

Error:Errorwhilecompilingstatement:FAILED:ParseExceptionline1:131

missing)at','near'<EOF>'

line1:145extraneousinput')'expectingEOFnear'<EOF>'

(state=42000,code=40000)

www.it-ebooks.info



www.it-ebooks.info


Advancedaggregation–ROLLUPandCUBETheROLLUPstatementenablesaSELECTstatementtocalculatemultiplelevelsofaggregationsacrossaspecifiedgroupofdimensions.TheROLLUPstatementisasimpleextensiontotheGROUPBYclausewithhighefficiencyandminimaloverheadtoaquery.ComparedtoGROUPINGSETSthatcreatesspecifiedlevelsofaggregations,ROLLUPcreatesn+1levelsofaggregations,wherenisthenumberofgroupingcolumns.First,itcalculatesthestandardaggregatevaluesspecifiedintheGROUPBYclause.Then,itcreateshigher-levelsubtotals,movingfromrighttoleftthroughthelistofcombinationsofgroupingcolumns,asshowninthefollowingexample:

GROUPBYa,b,cWITHROLLUP

Thisisequivalenttothefollowing:

GROUPBYa,b,cGROUPINGSETS((a,b,c),(a,b),(a),())

TheCUBEstatementtakesaspecifiedsetofgroupingcolumnsandcreatesaggregationsforalloftheirpossiblecombinations.IfncolumnsarespecifiedforCUBE,therewillbe2ncombinationsofaggregationsreturned,asshowninthefollowingexample:

GROUPBYa,b,cWITHCUBE

Thisisequivalenttothefollowing:

GROUPBYa,b,cGROUPINGSETS((a,b,c),(a,b),(b,c),(a,c),(a),(b),(c),())

TheGROUPING__IDfunctionworksasanextensiontodistinguishentirerowsfromeachother.ItacceptsoneormorecolumnsandreturnsthedecimalequivalentoftheBITvectorforeachcolumnspecifiedafterGROUPBY.Thereturneddecimalnumberisconvertedfromabinaryof1sand0s,whichrepresentswhetherthecolumnisaggregated(valueisnotNULL)intherow.TheorderofcolumnsstartsfromcountingthenearestcolumnfromGROUPBY.Inthefollowingexample,thefirstcolumnisstart_date:

jdbc:hive2://>SELECTGROUPING__ID,

.......>BIN(CAST(GROUPING__IDASBIGINT))ASbit_vector,

.......>name,start_date,count(employee_id)emp_id_cnt

.......>FROMemployee_hr

.......>GROUPBYstart_date,name

.......>WITHCUBEORDERBYstart_date;

+---------------+-------------+----------+-------------+------------+

|grouping__id|bit_vector|name|start_date|emp_id_cnt|

+---------------+-------------+----------+-------------+------------+

|2|10|Steven|NULL|1|

|2|10|Michael|NULL|1|

|2|10|Lucy|NULL|1|

|0|0|NULL|NULL|4|

|2|10|Will|NULL|1|

|3|11|Lucy|2010-01-03|1|

|1|1|NULL|2010-01-03|1|

www.it-ebooks.info


|1|1|NULL|2012-11-03|1|

|3|11|Steven|2012-11-03|1|

|1|1|NULL|2013-10-02|1|

|3|11|Will|2013-10-02|1|

|1|1|NULL|2014-01-29|1|

|3|11|Michael|2014-01-29|1|

+---------------+-------------+----------+-------------+------------+


www.it-ebooks.info


www.it-ebooks.info


Aggregationcondition–HAVINGSinceHive0.7.0,HAVINGisaddedtosupporttheconditionalfilteringofGROUPBYresults.ByusingHAVING,wecanavoidusingasubqueryafterGROUPBY.Thefollowingisanexample:

jdbc:hive2://>SELECTsex_age.ageFROMemployee

.......>GROUPBYsex_age.ageHAVINGcount(*)<=1;

+--------------+

|sex_age.age|

+--------------+

|57|

|27|

|35|

+--------------+


IfwedonotuseHAVING,wecanuseasubqueryforinstanceasfollows:

jdbc:hive2://>SELECTa.age

.......>FROM

.......>(SELECTcount(*)ascnt,sex_age.age

.......>FROMemployeeGROUPBYsex_age.age

.......>)aWHEREa.cnt<=1;

+--------+

|a.age|

+--------+

|57|

|27|

|35|

+--------+


www.it-ebooks.info


www.it-ebooks.info


AnalyticfunctionsAnalyticfunctions,availablesinceHive0.11.0,areaspecialgroupoffunctionsthatscanthemultipleinputrowstocomputeeachoutputvalue.AnalyticfunctionsareusuallyusedwithOVER,PARTITIONBY,ORDERBY,andthewindowingspecification.DifferentfromtheregularaggregatefunctionsusedwiththeGROUPBYclausethatislimitedtooneresultvaluepergroup,analyticfunctionsoperateonwindowswheretheinputrowsareorderedandgroupedusingflexibleconditionsexpressedthroughanOVERPARTITIONclause.Thoughanalyticfunctionsgiveaggregateresults,theydonotgrouptheresultset.Theyreturnthegroupvaluemultipletimeswitheachrecord.TheanalyticfunctionsoffergreatflexibilityandfunctionalitiesthantheregularGROUPBYclauseandmakespecialaggregationsinHiveeasierandpowerful.Thesyntaxfortheanalyzefunctionisasfollows:

Function(arg1,...,argn)OVER([PARTITIONBY<...>][ORDERBY<....>]

[<window_clause>])

TheFunction(arg1,...,argn)canbeanyfunctioninthefollowinglistwithexamples:

Standardaggregations:ThiscanbeeitherCOUNT(),SUM(),MIN(),MAX(),orAVG().RANK:Itranksitemsinagroup,suchasfindingthetopNrowsforspecificconditions.DENSE_RANK:ItissimilartoRANK,butleavesnogapsintherankingsequencewhenthereareties.Forexample,ifwerankamatchusingDENSE_RANKandhadtwoplayerstieforsecondplace,wewouldseethatthetwoplayerswereinsecondplaceandthatthenextpersonisrankedasthird.However,theRANKfunctionwouldalsoranktwopeopleinsecondplace,butthenextpersonwouldbeinfourthplace.ROW_NUMBER:Itassignsauniquesequencenumberstartingfrom1toeachrowaccordingtothepartitionandorderspecification.CUME_DIST:Itcomputesthenumberofrowswhosevalueissmallerorequaltothevalueofthetotalnumberofrowsdividedbythecurrentrow.PERCENT_RANK:ItissimilartoCUME_DIST,butitusesrankvaluesratherthanrowcountsinitsnumeratorastotalnumberofrows-1dividedbycurrentrank-1.Therefore,itreturnsthepercentrankofavaluerelativetoagroupofvalues.NTILE:Itdividesanordereddatasetintonumberofbucketsandassignsanappropriatebucketnumbertoeachrow.Itcanbeusedtodividerowsintoequalsetsandassignanumbertoeachrow.LEAD:TheLEADfunction,lead(value_expr[,offset[,default]]),isusedtoreturndatafromthenextrow.Thenumber(value_expr)ofrowstoleadcanoptionallybespecified.Ifthenumberofrows(offset)toleadisnotspecified,theleadisonerowbydefault.Itreturns[,default]ornullwhenthedefaultisnotspecifiedandtheleadforthecurrentrowextendsbeyondtheendofthewindow.LAG:TheLAGfunction,lag(value_expr[,offset[,default]]),isusedtoaccessdatafromapreviousrow.Thenumber(value_expr)ofrowstolagcanoptionallybespecified.Ifthenumberofrows(offset)tolagisnotspecified,thelagisonerowbydefault.Itreturns[,default]ornullwhenthedefaultisnotspecifiedandthelagforthecurrentrowextendsbeyondtheendofthewindow.

www.it-ebooks.info


FIRST_VALUE:Itreturnsthefirstresultfromanorderedset.LAST_VALUE:Itreturnsthelastresultfromanorderedset.ForLAST_VALUE,usingthedefaultwindowingclause,theresultcanbealittleunexpected.ThisisbecausethedefaultwindowingclauseisRANGEBETWEENUNBOUNDEDPRECEDINGANDCURRENTROW,whichinthisexamplemeansthecurrentrowwillalwaysbethelastvalue.ChangingthewindowingclausetoRANGEBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWINGgivesustheresultweprobablyexpected(seethelast_valuecolumninthefollowingexamples).

The[PARTITIONBY<...>]statementissimilartotheGROUPBYclause.Itdividestherowsintogroupscontainingidenticalvaluesinoneormorepartitionsbycolumns.Theselogicalgroupsareknownaspartitions,whichisnotthesametermusedforpartitiontables.OmittingthePARTITIONBYstatementappliestheanalyticoperationtoalltherowsinthetable.

The[ORDERBY<....>]clauseisliketheORDERBYexpr[ASC|DESC]clause.TheORDERBYclauseisthesameastheregularORDERBYclause.ItmakessuretherowsproducedbythePARTITIONBYclauseareorderedbyspecifications,suchasascendingordescendingorder.Rightnow,HiveonlysupportsoneORDERBYcolumninthiscase.Otherwise,itwillthrowasemanticexception(seeApacheJiraHIVE-4662athttps://issues.apache.org/jira/browse/HIVE-4662).Theworkaroundistousetherowsunboundedprecedingwindowingclause(seerunningTotal2columninthefollowingexamples):

Preparethetableanddatafordemonstration:

jdbc:hive2://>CREATETABLEIFNOTEXISTSemployee_contract

.......>(

.......>namestring,

.......>dept_numint,


.......>salaryint,

.......>typestring,

.......>start_datedate

.......>)






.......>'/home/dayongd/Downloads/employee_contract.txt'

.......>OVERWRITEINTOTABLEemployee_contract;


Theregularaggregationsareusedasanalyticfunctions,asfollows:

jdbc:hive2://>SELECTname,dept_num,salary,

.......>COUNT(*)OVER(PARTITIONBYdept_num)ASrow_cnt,

.......>SUM(salary)OVER(PARTITIONBYdept_num

.......>ORDERBYdept_num)ASdeptTotal,

.......>SUM(salary)OVER(ORDERBYdept_num)

www.it-ebooks.info



.......>ASrunningTotal1,SUM(salary)

.......>OVER(ORDERBYdept_num,namerowsunbounded

.......>preceding)ASrunningTotal2

.......>FROMemployee_contract

.......>ORDERBYdept_num,name;

+-------+--------+------+-------+---------+-------------+-------------+

|name|dept_num|salary|row_cnt|deptTotal|runningTotal1|runningTotal2|

+-------+--------+------+-------+---------+-------------+-------------+

|Lucy|1000|5500|5|24900|24900|5500|

|Michael|1000|5000|5|24900|24900|10500|

|Steven|1000|6400|5|24900|24900|16900|

|Will|1000|4000|5|24900|24900|24900|

|Will|1000|4000|5|24900|24900|20900|

|Jess|1001|6000|3|17400|42300|30900|

|Lily|1001|5000|3|17400|42300|35900|

|Mike|1001|6400|3|17400|42300|42300|

|Richard|1002|8000|3|20500|62800|50300|

|Wei|1002|7000|3|20500|62800|57300|

|Yun|1002|5500|3|20500|62800|62800|

+-------+--------+------+-------+---------+-------------+-------------+


Otheranalyticfunctionsareusedasfollows:


.......>RANK()OVER(PARTITIONBYdept_numORDERBYsalary)

.......>ASrank,

.......>DENSE_RANK()

.......>OVER(PARTITIONBYdept_numORDERBYsalary)

.......>ASdense_rank,ROW_NUMBER()OVER()ASrow_num,

.......>ROUND((CUME_DIST()OVER(PARTITIONBYdept_num

.......>ORDERBYsalary)),1)AScume_dist,

.......>PERCENT_RANK()OVER(PARTITIONBYdept_num

.......>ORDERBYsalary)ASpercent_rank,NTILE(4)

.......>OVER(PARTITIONBYdept_numORDERBYsalary)

.......>ASntile

.......>FROMemployee_contractORDERBYdept_num;

+-------+--------+------+----+----------+-------+---------+------------

+-----+

|name

|dept_num|salary|rank|dense_rank|row_num|cume_dist|percent_rank|ntile|

+-------+--------+------+----+----------+-------+---------+------------

+-----+

|Will|1000|4000|1|1|11|0.4|0.0

|1|

|Will|1000|4000|1|1|10|0.4|0.0

|1|

|Michael|1000|5000|3|2|9|0.6|0.5

|2|

|Lucy|1000|5500|4|3|8|0.8|0.75

|3|

|Steven|1000|6400|5|4|7|1.0|1.0

|4|

|Lily|1001|5000|1|1|6|0.3|0.0

|1|

|Jess|1001|6000|2|2|5|0.7|0.5

www.it-ebooks.info


|2|

|Mike|1001|6400|3|3|4|1.0|1.0

|3|

|Yun|1002|5500|1|1|3|0.3|0.0

|1|

|Wei|1002|7000|2|2|2|0.7|0.5

|2|

|Richard|1002|8000|3|3|1|1.0|1.0

|3|

+-------+--------+------+----+----------+-------+---------+------------

+-----+



.......>LEAD(salary,2)OVER(PARTITIONBYdept_num

.......>ORDERBYsalary)ASlead,

.......>LAG(salary,2,0)OVER(PARTITIONBYdept_num

.......>ORDERBYsalary)ASlag,

.......>FIRST_VALUE(salary)OVER(PARTITIONBYdept_num

.......>ORDERBYsalary)ASfirst_value,

.......>LAST_VALUE(salary)OVER(PARTITIONBYdept_num

.......>ORDERBYsalary)ASlast_value_default,

.......>LAST_VALUE(salary)OVER(PARTITIONBYdept_num

.......>ORDERBYsalary

.......>RANGEBETWEENUNBOUNDEDPRECEDING

.......>ANDUNBOUNDEDFOLLOWING)ASlast_value

.......>FROMemployee_contractORDERBYdept_num;

+-------+--------+------+----+----+-----------+------------------+-----

----+

|name|dept_num|salary|lead|lag|first_value|last_value_default|

last_value

|

+-------+--------+------+----+----+-----------+------------------+-----

----+

|Will|1000|4000|5000|0|4000|4000|6400

|

|Will|1000|4000|5500|0|4000|4000|6400

|

|Michael|1000|5000|6400|4000|4000|5000|6400

|

|Lucy|1000|5500|NULL|4000|4000|5500|6400

|

|Steven|1000|6400|NULL|5000|4000|6400|6400

|

|Lily|1001|5000|6400|0|5000|5000|6400

|

|Jess|1001|6000|NULL|0|5000|6000|6400

|

|Mike|1001|6400|NULL|5000|5000|6400|6400

|

|Yun|1002|5500|8000|0|5500|5500|8000

|

|Wei|1002|7000|NULL|0|5500|7000|8000

|

|Richard|1002|8000|NULL|5500|5500|8000|8000

|

www.it-ebooks.info


+-------+--------+------+----+----+-----------+------------------+-----

----+


The[<window_clause>]clauseisusedtofurthersubpartitiontheresultandapplytheanalyticfunctions.Therearetwotypesofwindows:rowtypewindowandrangetypewindow.

NoteAccordingtothearticleathttps://issues.apache.org/jira/browse/HIVE-4797,theRANK,NTILE,DENSE_RANK,CUME_DIST,PERCENT_RANK,LEAD,LAG,andROW_NUMBERfunctionsdonotsupportbeingusedwithawindowclauseyet.

Forrowtypewindows,thedefinitionisintermsofrownumbersbeforeorafterthecurrentrow.Thegeneralsyntaxoftherowwindowclauseisasfollows:

ROWSBETWEEN<start_expr>AND<end_expr>

The<start_expr>canbeanyoneofthefollowing:

UNBOUNDEDPRECEDING

CURRENTROW

NPRECEDINGorFOLLOWING

The<end_expr>canbeanyoneofthefollowing:

UNBOUNDEDFOLLOWING

CURRENTROW

NPRECEDINGorFOLLOWING

Thefollowingarethewindowexpressions:

BETWEEN…AND:UsetheBETWEEN…ANDclausetospecifythestartpointandendpointforthewindow.Thefirstexpression(beforeAND)definesthestartpointandthesecondexpression(afterAND)definestheendpoint.IfweomitBETWEEN…AND(suchasROWSNPRECEDINGorROWSUNBOUNDEDPRECEDING),Hiveconsidersitasthestartpoint,andtheendpointdefaultstothecurrentrow(seewin13columnintheupcomingexamples).NPRECEDINGorFOLLOWING:ThisindicatesNrowsbeforeorafterthecurrentrow.UNBOUNDEDPRECEDING:Thisindicatesthatthewindowstartsatthefirstrowofthepartition.Thisisthestartpointspecificationandcannotbeusedasanendpointspecification.UNBOUNDEDFOLLOWING:Thisindicatesthatthewindowendsatthelastrowofthepartition.Thisistheendpointspecificationandcannotbeusedasastartpointspecification.UNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING:Thisindicatesthefirstandlastrowforeveryrow,meaningallrowsinthetable(seewin12columnintheupcomingexamples).CURRENTROW:Asastartpoint,CURRENTROWspecifiesthatthewindowbeginsatthecurrentroworvaluedependingonwhetherwehavespecifiedROWorRANGE(RANGE

www.it-ebooks.info



isintroducedlaterinthischapter).Inthiscase,theendpointcannotbeNPRECEDING.Asanendpoint,CURRENTROWspecifiesthatthewindowendsatthecurrentroworvaluedependingonwhetherwehavespecifiedROWorRANGE.Inthiscase,thestartpointcannotbeNFOLLOWING.

Thefollowingisadiagramthatcanhelpusunderstandtheprecedingdefinitionsmoreclearly:

Windowexpressiondefinition

Thefollowingexamplesimplementthewindowexpressions:

jdbc:hive2://>SELECTname,dept_numASdept,salaryASsal,

.......>MAX(salary)OVER(PARTITIONBYdept_numORDERBY

.......>nameROWS

.......>BETWEEN2PRECEDINGANDCURRENTROW)win1,


.......>nameROWS

.......>BETWEEN2PRECEDINGANDUNBOUNDEDFOLLOWING)win2,


.......>nameROWS

.......>BETWEEN1PRECEDINGAND2FOLLOWING)win3,


.......>nameROWS

.......>BETWEEN1PRECEDINGAND2PRECEDING)win4,


.......>nameROWS

.......>BETWEEN1FOLLOWINGAND2FOLLOWING)win5,


.......>nameROWS

.......>BETWEENCURRENTROWANDCURRENTROW)win7,


.......>nameROWS

.......>BETWEENCURRENTROWAND1FOLLOWING)win8,


.......>nameROWS

.......>BETWEENCURRENTROWANDUNBOUNDEDFOLLOWING)win9,


.......>nameROWS

www.it-ebooks.info


.......>BETWEENUNBOUNDEDPRECEDINGANDCURRENTROW)win10,


.......>nameROWS

.......>BETWEENUNBOUNDEDPRECEDINGAND1FOLLOWING)win11,


.......>nameROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDED

.......>FOLLOWING)win12,


.......>nameROWS2PRECEDING)win13



+-------+----+----+----+----+----+----+----+----+----+----+-----+-----+----

-+-----+

|name|dept|sal

|win1|win2|win3|win4|win5|win7|win8|win9|win10|win11|win12|win13|

+-------+----+----+----+----+----+----+----+----+----+----+-----+-----+----

-+-----+

|Lucy|1000|5500|5500|6400|6400|NULL|6400|5500|5500|6400|5500|5500|6400

|5500|

|Michael|1000|5000|5500|6400|6400|NULL|6400|5000|6400|6400|5500|6400|6400

|5500|

|Steven|1000|6400|6400|6400|6400|NULL|4000|6400|6400|6400|6400|6400|6400

|6400|

|Will|1000|4000|6400|6400|4000|NULL|NULL|4000|4000|4000|6400|6400|6400

|6400|

|Will|1000|4000|6400|6400|6400|NULL|4000|4000|4000|4000|6400|6400|6400

|6400|

|Jess|1001|6000|6000|6400|6400|NULL|6400|6000|6000|6400|6000|6000|6400

|6000|

|Lily|1001|5000|6000|6400|6400|NULL|6400|5000|6400|6400|6000|6400|6400

|6000|

|Mike|1001|6400|6400|6400|6400|NULL|NULL|6400|6400|6400|6400|6400|6400

|6400|

|Richard|1002|8000|8000|8000|8000|NULL|7000|8000|8000|8000|8000|8000|8000

|8000|

|Wei|1002|7000|8000|8000|8000|NULL|5500|7000|7000|7000|8000|8000|8000

|8000|

|Yun|1002|5500|8000|8000|7000|NULL|NULL|5500|5500|5500|8000|8000|8000

|8000|

+-------+----+----+----+----+----+----+----+----+----+----+-----+-----+----

-+-----+


Fromtheprecedingexample,wecanseethatthewin4columnisNULL.Thisisbecausetherowspecifiedby<start_expr>mustbesmallerthantherowspecifiedby<end_expr>.However,ifwetrytofixitbyreorderingit,especiallywhenusingthePRECEDINGkeyword,itreportsthefollowingexceptionsandthesamethingappliestoUNBOUNDEDPRECEDING.Thisisanissue(https://issues.apache.org/jira/browse/HIVE-9412)forHivewindowingrightnow:



.......>nameROWS

.......>BETWEEN2PRECEDINGAND1PRECEDING)win4_alter


www.it-ebooks.info




Error:Errorwhilecompilingstatement:FAILED:SemanticExceptionFailedto

breakupWindowinginvocationsintoGroups.Atleast1groupmustonly

dependoninputcolumns.Alsocheckforcirculardependencies.

Underlyingerror:Windowrangeinvalid,startboundaryisgreaterthanend

boundary:window(start=range(2PRECEDING),end=range(1PRECEDING))

(state=42000,code=40000)



.......>nameROWS

.......>BETWEENUNBOUNDEDPRECEDINGAND1PRECEDING)win1



Error:Errorwhilecompilingstatement:FAILED:SemanticExceptionEndofa

WindowFramecannotbeUNBOUNDEDPRECEDING(state=42000,code=40000)

Inaddition,windowscanbedefinedinaseparateWINDOWclauseorreferredbyotherwindows,asfollows:


.......>MAX(salary)OVERw1ASwin1,

.......>MAX(salary)OVERw1ASwin2,

.......>MAX(salary)OVERw1ASwin3


.......>ORDERBYdept_num,name

.......>WINDOW

.......>w1AS(PARTITIONBYdept_numORDERBYname

.......>ROWSBETWEEN2PRECEDINGANDCURRENTROW),

.......>w2ASw3,

.......>w3AS(PARTITIONBYdept_numORDERBYname

.......>ROWSBETWEEN1PRECEDINGAND2FOLLOWING);

+----------+-----------+---------+-------+-------+-------+

|name|dept_num|salary|win1|win2|win3|

+----------+-----------+---------+-------+-------+-------+

|Lucy|1000|5500|5500|5500|5500|

|Michael|1000|5000|5500|5500|5500|

|Steven|1000|6400|6400|6400|6400|

|Will|1000|4000|6400|6400|6400|

|Will|1000|4000|6400|6400|6400|

|Jess|1001|6000|6000|6000|6000|

|Lily|1001|5000|6000|6000|6000|

|Mike|1001|6400|6400|6400|6400|

|Richard|1002|8000|8000|8000|8000|

|Wei|1002|7000|8000|8000|8000|

|Yun|1002|5500|8000|8000|8000|

+----------+-----------+---------+-------+-------+-------+


Comparedtorowtypewindowsintermsofrows,therangetypewindowsareintermsofvaluesbeforeorafterthecurrentORDERBYcolumn,whichmustbeanumberordatetype.Fornow,onlyoneORDERBYcolumnissupportedbyrangetypewindows.

jdbc:hive2://>SELECTname,salary,start_year,


.......>start_yearRANGE

www.it-ebooks.info


.......>BETWEEN2PRECEDINGANDCURRENTROW)win1

.......>FROM

.......>(

.......>SELECTname,salary,dept_num,

.......>YEAR(start_date)ASstart_year


.......>)a;

+----------+---------+-------------+-------+

|name|salary|start_year|win1|

+----------+---------+-------------+-------+

|Lucy|5500|2010|5500|

|Steven|6400|2012|6400|

|Will|4000|2013|6400|

|Will|4000|2014|6400|

|Michael|5000|2014|6400|

|Mike|6400|2013|6400|

|Jess|6000|2014|6400|

|Lily|5000|2014|6400|

|Wei|7000|2010|7000|

|Richard|8000|2013|8000|

|Yun|5500|2014|8000|

+----------+---------+-------------+-------+


NoteIfweomitthewindowingclauseentirely,thedefaultwindowisRANGEBETWEENUNBOUNDEDPRECEDINGANDCURRENTROW.

www.it-ebooks.info


www.it-ebooks.info


SamplingWhendatavolumeisextralarge,wemayneedtofindasubsetofdatatospeedupdataanalysis.Hereitcomestoatechniqueusedtoselectandanalyzeasubsetofdatainordertoidentifypatternsandtrends.InHive,therearethreewaysofsamplingdata:randomsampling,buckettablesampling,andblocksampling.

RandomsamplingusestheRAND()functionandLIMITkeywordtogetthesamplingofdataasshowninthefollowingexample.TheDISTRIBUTEandSORTkeywordsareusedheretomakesurethedataisalsorandomlydistributedamongmappersandreducersefficiently.TheORDERBYRAND()statementcanalsoachievethesamepurpose,buttheperformanceisnotgood:

SELECT*FROM<Table_Name>DISTRIBUTEBYRAND()SORTBYRAND()

LIMIT<Nrowstosample>;

Buckettablesamplingisaspecialsamplingoptimizedforbuckettablesasshowninthefollowingsyntaxandexample.Thecolnamevaluespecifiesthecolumnwheretosamplethedata.TheRAND()functioncanalsobeusedwhensamplingisontheentirerows.IfthesamplecolumnisalsotheCLUSTEREDBYcolumn,theTABLESAMPLEstatementwillbemoreefficient.

--Syntax

SELECT*FROM<Table_Name>

TABLESAMPLE(BUCKET<specifiedbucketnumbertosample>OUTOF<totalnumber

ofbuckets>ON[colname|RAND()])table_alias;

--Anexample

jdbc:hive2://>SELECTnameFROMemployee_id_buckets

.......>TABLESAMPLE(BUCKET1OUTOF2ONrand())a;

+----------+

|name|

+----------+

|Lucy|

|Shelley|

|Lucy|

|Lucy|

|Shelley|

|Lucy|

|Will|

|Shelley|

|Michael|

|Will|

|Will|

|Will|

|Will|

|Will|

|Lucy|

+----------+


BlocksamplingallowsHivetorandomlypickupNrowsofdata,percentage(n

www.it-ebooks.info


www.it-ebooks.info


SummaryInthischapter,wecoveredhowtoaggregatedatausingbasicaggregationfunctions.Then,weintroducedtheadvancedaggregationswithGROUPINGSETS,ROLLUP,andCUBE,aswellasaggregationconditionsusingHAVING.Wealsocoveredthevariousanalyticfunctionsandwindowingclauses.Attheendofthechapter,weintroducedthreewaysofsamplingdatainHive.Aftergoingthroughthischapter,youshouldbeabletodobasicandadvancedaggregationsanddatasamplinginHive.

Inthenextchapter,we’lltalkaboutperformanceconsiderationsinHive.

www.it-ebooks.info


www.it-ebooks.info


Chapter7.PerformanceConsiderationsAlthoughHiveisbuilttodealwithbigdata,westillcannotignoretheimportanceofperformance.Mostofthetime,abetterHivequerycanrelyonthesmartqueryoptimizertofindthebestexecutionstrategyaswellasthedefaultsettingbestpracticefromvendorpackages.However,asexperiencedusers,weshouldlearnmoreaboutthetheoryandpracticeofperformancetuninginHive,especiallywhenworkinginaperformance-basedprojectorenvironment.Inthischapter,wewillstartfromutilitiesavailableinHivetofindpotentialissuescausingpoorperformance.Then,weintroducethebestpracticesofperformanceconsiderationsintheareasofdesign,fileformat,compression,storage,query,andjob.


PerformanceutilitiesDesignoptimizationDatafileoptimizationJobandqueryoptimization

www.it-ebooks.info


PerformanceutilitiesHiveprovidestheEXPLAINandANALYZEstatementsthatcanbeusedasutilitiestocheckandidentifytheperformanceofqueries.

www.it-ebooks.info


TheEXPLAINstatementHiveprovidesanEXPLAINcommandtoreturnaqueryexecutionplanwithoutrunningthequery.WecanuseanEXPLAINcommandforqueriesifwehaveadoubtoraconcernaboutperformance.TheEXPLAINcommandwillhelptoseethedifferencebetweentwoormorequeriesforthesamepurpose.ThesyntaxforEXPLAINisasfollows:

EXPLAIN[EXTENDED|DEPENDENCY|AUTHORIZATION]hive_query

Thefollowingkeywordscanbeused:

EXTENDED:Thisprovidesadditionalinformationfortheoperatorsintheplan,suchasfilepathnameandabstractsyntaxtree.DEPENDENCY:ThisprovidesaJSONformatoutputthatcontainsalistoftablesandpartitionsthatthequerydependson.ItisavailablesinceHIVE0.10.0.AUTHORIZATION:ThislistsallentitiesneededtobeauthorizedincludinginputandoutputtoruntheHivequeryandauthorizationfailures,ifany.ItisavailablesinceHIVE0.14.0.

Atypicalqueryplancontainsthefollowingthreesections.Wewillalsohavealookatanexamplelater:

Abstractsyntaxtree(AST):HiveusesapacergeneratorcalledANTLR(seehttp://www.antlr.org/)toautomaticallygenerateatreeofsyntaxforHQL.Wecanusuallyignorethismostofthetime.Stagedependencies:Thislistsalldependenciesandnumberofstagesusedtorunthequery.Stageplans:Itcontainsimportantinformation,suchasoperatorsandsortorders,forrunningthejob.

Thefollowingiswhatatypicalqueryplanlookslike.Fromthefollowingexample,wecanseethattheASTsectionisnotshownsincetheEXTENDEDkeywordisnotusedwithEXPLAIN.IntheSTAGEDEPENDENCIESsection,bothStage-0andStage-1areindependentrootstages.IntheSTAGEPLANSsection,Stage-1hasonemapandreducereferredtobyMapOperatorTreeandReduceOperatorTree.InsideeachMap/ReduceOperatorTreesection,alloperatorscorrespondingtoHivequerykeywordsaswellasexpressionsandaggregationsarelisted.TheStage-0stagedoesnothavemapandreduce.ItisjustaFetchoperation.

jdbc:hive2://>EXPLAINSELECTsex_age.sex,count(*)

.......>FROMemployee_partitioned

.......>WHEREyear=2014GROUPBYsex_age.sexLIMIT2;

+--------------------------------------------------------------------------

---+

|Explain

|

+--------------------------------------------------------------------------

---+

|STAGEDEPENDENCIES:

|

www.it-ebooks.info

http://www.antlr.org/


|Columnstats:NONE

|

|valueexpressions:_col1(type:bigint)

|

|ReduceOperatorTree:

|

|GroupByOperator

|

|aggregations:count(VALUE._col0)

|

|keys:KEY._col0(type:string)

|

|mode:mergepartial

|


|

|Statistics:Numrows:0Datasize:0Basicstats:NONE

|

|Columnstats:NONE

|

|SelectOperator

|

|expressions:_col0(type:string),_col1(type:bigint)

|


|


|

|Columnstats:NONE

|

|Limit

|

|Numberofrows:2

|


|

|Columnstats:NONE

|

|FileOutputOperator

|

|compressed:false

|


|

|Columnstats:NONE

|

|table:

|

|inputformat:

org.apache.hadoop.mapred.TextInputFormat|

|output

format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat|

|

serde:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe||

|

|Stage:Stage-0

www.it-ebooks.info


TheANALYZEstatementHivestatisticsareacollectionofdatathatdescribemoredetails,suchasthenumberofrows,numberoffiles,andrawdatasize,ontheobjectsintheHivedatabase.StatisticsisametadataofHivedata.Hivesupportsstatisticsatthetable,partition,andcolumnlevel.ThesestatisticsserveasaninputtotheHiveCost-BasedOptimizer(CBO),whichisanoptimizertopickthequeryplanwiththelowestcostintermsofsystemresourcesrequiredtocompletethequery.

ThestatisticsaregatheredthroughtheANALYZEstatementsinceHive0.10.0ontables,partitions,andcolumnsasgiveninthefollowingexamples:

jdbc:hive2://>ANALYZETABLEemployeeCOMPUTESTATISTICS;


jdbc:hive2://>ANALYZETABLEemployee_partitioned

.......>PARTITION(year=2014,month=12)COMPUTESTATISTICS;


jdbc:hive2://>ANALYZETABLEemployee_idCOMPUTESTATISTICS

.......>FORCOLUMNSemployee_id;


Oncethestatisticsarebuilt,wecancheckthestatisticsbytheDESCRIBEEXTENDED/FORMATTEDstatement.Fromthetable/partitionoutput,wecanfindthestatisticsinformationinsidetheparameters,suchasparameters:{numFiles=1,COLUMN_STATS_ACCURATE=true,transient_lastDdlTime=1417726247,numRows=4,

totalSize=227,rawDataSize=223}).Thefollowingisanexample:

jdbc:hive2://>DESCRIBEEXTENDEDemployee_partitioned


jdbc:hive2://>DESCRIBEEXTENDEDemployee;

…

parameters:{numFiles=1,COLUMN_STATS_ACCURATE=true,

transient_lastDdlTime=1417726247,numRows=4,totalSize=227,

rawDataSize=223}).

jdbc:hive2://>DESCRIBEFORMATTEDemployee.name;

+--------+---------+---+---+---------+--------------+-----------+----------

-+

|col_name|data_type|min|max|num_nulls|distinct_count|avg_col_len|max_col_le

n|

+--------+---------+---+---+---------+--------------+-----------+----------

-+

|name|string|||0|5|5.6|7

|

+--------+---------+---+---+---------+--------------+-----------+----------

-+

+---------+----------+-----------------+

|num_trues|num_falses|comment|

+---------+----------+-----------------+

|||fromdeserializer|

www.it-ebooks.info


+---------+----------+-----------------+


Hivestatisticsarepersistedinthemetastoretoavoidcomputingthemeverytime.Fornewlycreatedtablesand/orpartitions,statisticsareautomaticallycomputedbydefaultifweenablethefollowingsetting:

jdbc:hive2://>SEThive.stats.autogather=ture;

NoteHivelogs

LogsprovideusefulinformationtofindouthowaHivequery/jobruns.BycheckingtheHivelogs,wecanidentifyruntimeproblemsandissuesthatmaycausebadperformance.TherearetwotypesoflogsavailableinHive:systemlogandjoblog.

ThesystemlogcontainstheHiverunningstatusandissues.Itisconfiguredin{HIVE_HOME}/conf/hive-log4j.properties.ThefollowingthreelinesforHivelogcanbefound:

hive.root.logger=WARN,DRFA

hive.log.dir=/tmp/${user.name}

hive.log.file=hive.log

Tomodifythestatus,wecaneithermodifytheprecedinglinesinhive-log4j.properties(appliestoallusers)orsetfromtheHiveCLI(onlyappliestothecurrentuserandcurrentsession)asfollows:

hive--hiveconfhive.root.logger=DEBUG,console

ThejoblogcontainsHivequeryinformationandissavedatthesameplace,/tmp/${user.name},bydefaultasonefileforeachHiveusersession.Wecanoverrideitinhive-site.xmlwiththehive.querylog.locationproperty.IfaHivequerygeneratesMapReducejobs,thoselogscanalsobeviewedthroughtheHadoopJobTrackerWebUI.

www.it-ebooks.info


www.it-ebooks.info


DesignoptimizationDesignoptimizationcoversseveraldatalayoutanddesignstrategiestoimproveperformance.

www.it-ebooks.info


PartitiontablesHivepartitioningisoneofthemosteffectivemethodstoimprovethequeryperformanceonlargertables.Thequerywithpartitionfilteringwillonlyloadthedatainthespecifiedpartitions(subdirectories),soitcanexecutemuchfasterthananormalquerythatfiltersbyanon-partitioningfield.Theselectionofpartitionkeyisalwaysanimportantfactorforperformance.Itshouldalwaysbealowcardinalattributetoavoidmanysubdirectoriesoverhead.

Thefollowingaresomecommonlyuseddimensionsaspartitionkeys:

Partitionsbydateandtime:Usedateandtime,suchasyear,month,andday(evenhours),aspartitionkeyswhendataisassociatedwiththetimedimensionPartitionsbylocations:Usecountry,territory,state,andcityaspartitionkeyswhendataislocationrelatedPartitionsbybusinesslogics:Usedepartment,salesregion,applications,customers,andsoonaspartitionedkeyswhendatacanbeseparatedevenlybysomebusinesslogic

www.it-ebooks.info


BuckettablesSimilartopartitioning,abuckettableorganizesdataintoseparatefilesintheHDFS.BucketingcanspeedupthedatasamplinginHivewithsamplingonbuckets.Bucketingcanalsoimprovethejoinperformanceifthejoinkeysarealsobucketkeysbecausebucketingensuresthatthekeyispresentinacertainbucket.MoredetailsaregivenintheJobandQueryoptimizationsectioninthischapter.

www.it-ebooks.info


IndexIndexisverycommonwithRDBMSwhenwewanttospeedaccesstoacolumnorsetofcolumns.Hivesupportsindexcreationontables/partitionssinceHive0.7.0.TheindexinHiveprovideskey-baseddataviewandbetterdataaccessforcertainoperations,suchasWHERE,GROUPBY,andJOIN.Wecanuseindexisacheaperalternativethanfulltablescans.ThecommandtocreateanindexinHiveisstraightforwardasfollows:

jdbc:hive2://>CREATEINDEXidx_id_employee_id

.......>ONTABLEemployee_id(employee_id)

.......>AS'COMPACT'

.......>WITHDEFERREDREBUILD;


InadditiontotheCOMPACTkeyword(referstoorg.apache.hadoop.hive.ql.index.compact.CompactIndexHandler)usedintheprecedingexample,HivealsosupportsBITMAPindexessinceHIVE0.8.0forcolumnswithlessdifferentvalues,asshowninthefollowingexample:

jdbc:hive2://>CREATEINDEXidx_sex_employee_id

.......>ONTABLEemployee_id(sex_age)

.......>AS'BITMAP'

.......>WITHDEFERREDREBUILD;


TheWITHDEFERREDREBUILDkeywordintheprecedingexamplepreventstheindexfromimmediatelybeingbuilt.Tobuildtheindex,wecanissueALTER…REBUILDcommandsasinthefollowingexample.Whendatainthebasetablechanges,theALTER…REBUILDcommandmustbeusedtobringtheindexuptodate.Thisisanatomicoperation,soiftheindexrebuiltonatablethathasbeenpreviouslyindexedfailed,thestateofindexremainsthesame,asshownhere:

jdbc:hive2://>ALTERINDEXidx_id_employee_idONemployee_idREBUILD;


jdbc:hive2://>ALTERINDEXidx_sex_employee_idONemployee_id

.......>REBUILD;


Oncetheindexisbuilt,Hivewillcreateanewindextableforeachindexasfollows:


+-----------+------------------------------------------+-----------+-------

+

|TABLE_SCHEM|TABLE_NAME|

TABLE_TYPE|REMARKS|

+-----------+------------------------------------------+-----------+-------

+

|default|default__employee_id_idx_id_employee_id__|INDEX_TABLE|NULL

|

|default|default__employee_id_idx_sex_employee_id__|INDEX_TABLE|NULL

|

+-----------+------------------------------------------+-----------+-------

www.it-ebooks.info


+

Theindextablewillhavenameconventionsuchasdefault__tablename_indexname__.Itcontainstheindexedcolumn,the_bucketname(typicalfileURIonHDFS),and_offsets(offsetsforeachrows).Then,thisindextablecanbeusedwhereweneedtoquerytheindexedcolumnslikearegulartable,asshownhere:

jdbc:hive2://>DESCdefault__employee_id_idx_id_employee_id__;

+--------------+----------------+----------+


+--------------+----------------+----------+

|employee_id|int||

|_bucketname|string||

|_offsets|array<bigint>||

+--------------+----------------+----------+


Todropanindex,wecanusetheDROPINDEXindex_nameONtable_namestatementasfollows.However,wecannotdroptheindextablewithaDROPTABLEstatement:

jdbc:hive2://>DROPINDEXidx_sex_employee_idONemployee_id;


NoteSinceHive0.13.0,Hiveincludesthefollowingnewfeaturesforperformanceoptimizations:

Tez:Tez(http://tez.apache.org/)isanapplicationframeworkbuiltonYarnthatcanexecutecomplexdirectedacyclicgraphs(DAGs)forgeneraldata-processingtasks.Tezfurthersplitsmapandreducejobsintosmallertasksandcombinestheminaflexibleandefficientwayforexecution.TezisconsideredaflexibleandpowerfulsuccessortotheMapReduceframework.ToconfigureHivetouseTez,weneedtooverwritethefollowingsettingsfromthedefaultMapReduce:

SEThive.execution.engine=tez;

Vectorization:Vectorizationoptimizationprocessesalargerbatchofdataatthesametimeratherthanonerowatatime,thussignificantlyreducingcomputingoverhead.Eachbatchconsistsofacolumnvectorthatisusuallyanarrayofprimitivetypes.Operationsareperformedontheentirecolumnvector,whichimprovestheinstructionpipelinesandcacheusage.FilesmustbestoredintheOptimizedRowColumnar(ORC)formatinordertousevectorization.Formoreonvectorization,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution.Toenablevectorization,weneedtodothefollowingsetting:

SEThive.vectorized.execution.enabled=true;

www.it-ebooks.info

http://tez.apache.org/

https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution


www.it-ebooks.info


DatafileoptimizationDatafileoptimizationcoverstheperformanceimprovementonthedatafilesintermsoffileformat,compression,andstorage.

www.it-ebooks.info


FileformatHivesupportsTEXTFILE,SEQUENCEFILE,RCFILE,ORC,andPARQUETfileformats.Thethreewaystospecifythefileformatareasfollows:

CREATETABLE…STOREAS<File_Format>

ALTERTABLE…[PARTITIONpartition_spec]SETFILEFORMAT<File_Format>

SEThive.default.fileformat=<File_Format>--defaultfileformatfor

table

Here,<File_Type>isTEXTFILE,SEQUENCEFILE,RCFILE,ORC,andPARQUET.

WecanloadatextfiledirectlytoatablewiththeTEXTFILEformat.Toloaddatatothetablewithotherfileformats,weneedtoloadthedatatoaTEXTFILEformattablefirst.Then,useINSERTOVERWRITETABLE<target_file_format_table>SELECT*FROM<text_format_source_table>toconvertandinsertthedatatothefileformatasexpected.

ThefileformatssupportedbyHiveandtheiroptimizationsareasfollows:

TEXTFILE:ThisisthedefaultfileformatforHive.Dataisnotcompressedinthetextfile.Itcanbecompressedwithcompressiontools,suchasGZip,Bzip2,andSnappy.However,thesecompressedfilesarenotsplittableasinputduringprocessing.Asaresult,itleadstorunningasingle,hugemapjobtoprocessonebigfile.SEQUENCEFILE:Thisisabinarystorageformatforkey/valuepairs.ThebenefitofasequencefileisthatitismorecompactthanatextfileandfitswellwiththeMapReduceoutputformat.Sequencefilescanbecompressedonrecordorblocklevelwhereblocklevelhasabettercompressionratio.Toenableblocklevelcompression,weneedtodothefollowingsettings:

jdbc:hive2://>SEThive.exec.compress.output=true;

jdbc:hive2://>SETio.seqfile.compression.type=BLOCK;

Unfortunately,bothtextandsequencefilesasarowlevelstoragefileformatarenotanoptimalsolutionsinceHivehastoreadafullrowevenifonlyonecolumnisbeingrequested.Forinstance,ahybridrow-columnarstoragefileformat,suchasRCFILE,ORC,andPARQUETimplementation,iscreatedtoresolvethisproblem.

RCFILE:ThisisshortforRecordColumnarFile.Itisaflatfileconsistingofbinarykey/valuepairsthatsharesmuchsimilaritywithasequencefile.TheRCFilesplitsdatahorizontallyintorowgroups.OneorseveralgroupsarestoredinanHDFSfile.Then,RCFilesavestherowgroupdatainacolumnarformatbysavingthefirstcolumnacrossallrows,thenthesecondcolumnacrossallrows,andsoon.ThisformatissplittableandallowsHivetoskipirrelevantpartsofdataandgettheresultsfasterandcheaper.ORC:ThisisshortforOptimizedRowColumnar.ItisavailablesinceHive0.11.0.TheORCformatcanbeconsideredanimprovedversionofRCFILE.Itprovidesalargerblocksizeof256MBbydefault(RCFILEhas4MBandSEQUENCEFILEhas1MB)optimizedforlargesequentialreadsonHDFSformorethroughputandfewerfilesto

www.it-ebooks.info


reduceoverloadinthenamenode.DifferentfromRCFILEthatreliesonmetastoretoknowdatatypes,theORCfileunderstandsthedatatypesbyusingspecificencoderssothatitcanoptimizecompressiondependingondifferenttypes.Italsostoresbasicstatistics,suchasMIN,MAX,SUM,andCOUNT,oncolumnsaswellasalightweightindexthatcanbeusedtoskipblocksofrowsthatdonotmatter.PARQUET:ThisisanotherrowcolumnarfileformatthathasasimilardesigntothatofORC.What’smore,ParquethasawiderrangeofsupportforthemajorityprojectsintheHadoopecosystemcomparedtoORCthatonlysupportsHiveandPig.ParquetleveragesthedesignbestpracticesofGoogle’sDremel(seehttp://research.google.com/pubs/pub36632.html)tosupportthenestedstructureofdata.ParquetissupportedbyapluginsinceHive0.10.0andhasgotnativesupportsinceHive0.13.0.

ConsideringthematurityofHive,itissuggestedtousetheORCformatifHiveisthemainmajoritytoolusedinyourHadoopenvironment.IfyouuseseveraltoolsintheHadoopecosystem,PARQUETisabetterchoiceintermsofadaptability.

NoteHadoopArchiveFile(HAR)isanothertypeoffileformattopackHDFSfilesintoarchives.Thisisanoption(notagoodoption)forstoringalargenumberofsmall-sizedfilesinHDFS,asstoringalargenumberofsmall-sizedfilesdirectlyinHDFSisnotveryefficient.However,HARstillhassomelimitationsthatmakeitunpopular,suchasimmutablearchiveprocess,notbeingsplittable,andcompatibilityissues.FormoreinformationaboutHARandarchiving,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+Archiving.

www.it-ebooks.info

http://research.google.com/pubs/pub36632.html

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Archiving


CompressionCompressiontechniquesinHivecansignificantlyreducetheamountofdatatransferringbetweenmappersandreducersbyproperintermediateoutputcompressionaswellasoutputdatasizeinHDFSbyoutputcompression.Asaresult,theoverallHivequerywillhavebetterperformance.TocompressintermediatefilesproducedbyHivebetweenmultipleMapReducejobs,weneedtosetthefollowingproperty(falsebydefault)intheHiveCLIorthehive-site.xmlfile:

jdbc:hive2://>SEThive.exec.compress.intermediate=true

Then,weneedtodecidewhichcompressioncodectoconfigure.AlistofcommoncodecssupportedinHadoopandHiveisasfollows:

Compression Codec Extension Splittable

Deflate org.apache.hadoop.io.compress.DefaultCodec .deflate N

GZip org.apache.hadoop.io.compress.GzipCodec .gz N

Bzip2 org.apache.hadoop.io.compress.BZip2Codec .gz Y

LZO com.hadoop.compression.lzo.LzopCodec .lzo N

LZ4 org.apache.hadoop.io.compress.Lz4Codec .lz4 N

Snappy org.apache.hadoop.io.compress.SnappyCodec .snappy N

Hadoophasadefaultcodec(.deflate).ThecompressionratioforGZipishigheraswellasitsCPUcost.Bzip2issplittable,butsplittingisn’tsupportedbyHadoopuntil1.1(seehttps://issues.apache.org/jira/browse/HADOOP-4012).Inaddition,Bzip2istooslowforcompressionconsideringitshugeCPUcost.LZOfilesarenotnativelysplittable.Butwecanpreprocessthem(usingcom.hadoop.compression.lzo.LzoIndexer)tocreateanindexthatdeterminesthefilesplits.WhenitcomestothebalanceofCPUcostandcompressionratio,LZ4orSnappydoabetterjob.Sincethemajorityofcodecdonotsupportsplitaftercompression,itissuggestedtoavoidcompressingbigfilesinHDFS.

Thecompressioncodeccanbespecifiedineithermapred-site.xml,hive-site.xml,orHiveCLI,asinthefollowingexample:

jdbc:hive2://>SEThive.intermediate.compression.codec=

.......>org.apache.hadoop.io.compress.SnappyCodec

Intermediatecompressionwillonlysavediskspaceforspecificjobsthatrequiremultiplemapandreducejobs.Forfurthersavingofdiskspace,theactualHiveoutputfilescanbecompressed.Whenthehive.exec.compress.outputpropertyissettotrue,Hivewillusethecodecconfiguredbythemapred.map.output.compression.codecpropertytocompressthestorageinHDFSasfollows.Thesepropertiescanbesetinthehive-site.xmlorintheHiveCLI.

www.it-ebooks.info

https://issues.apache.org/jira/browse/HADOOP-4012


jdbc:hive2://>SEThive.exec.compress.output=true

jdbc:hive2://>SETmapred.output.compression.codec=

.......>org.apache.hadoop.io.compress.SnappyCodec

www.it-ebooks.info


StorageoptimizationThedata,whichisusedorscannedfrequently,canbeidentifiedashotdata.Usually,thequeryperformanceonthehotdataiscriticalforoverallperformance.IncreasingthedatareplicationfactorinHDFS(seethefollowingexample)forhotdatacouldincreasethechanceofdatabeinghitlocallybyHivejobsandimprovetheperformance.However,thisisatrade-offforstorage.

$hdfsdfs-setrep-R-w4/user/hive/warehouse/employee

Replication4set:/user/hive/warehouse/employee/000000_0

Ontheotherhand,toomanyfilesorredundancycouldmakenamenode’smemoryexhausted,especiallyforlotsofsmallfileslessthantheHDFSblocksizes.Hadoopitselfalreadyhassomesolutionstodealwithtoomanysmall-fileissues,suchasthefollowing:

HadoopArchiveandHAR:Thesearetoolkitstopacksmallfiles.SequenceFileformat:Thisisaformattocompresssmallfilestobiggerfiles.CombineFileInputFormat:AtypeofInputFormattocombinesmallfilesbeforemapandreduceprocessing.ItisthedefaultInputFormatforHive(seehttps://issues.apache.org/jira/browse/HIVE-2245).HDFSfederation:Itmakesnamenodesextensibleandpowerfultomanagemorefiles.

WecanalsoleverageothertoolsintheHadoopecosystemifwehavetheminstalled,suchasthefollowing:

HBasehasasmallerblocksizeandbetterfileformattodealwithsmaller-fileaccessissuesFlumeNGcanbeusedaspipestomergesmallfilestobigonesAscheduledofflinefilemergeprogramtomergesmallfilesinHDFSorbeforeloadingthemtoHDFS

ForHive,wecandothefollowingconfigurationsformergingfilesofqueryresultstoavoidrecreatingsmallfiles:

hive.merge.mapfiles:Thismergessmallfilesattheendofamap-onlyjob.Bydefault,itistrue.hive.merge.mapredfiles:ThismergessmallfilesattheendofaMapReducejob.Setittotruesinceitsdefaultisfalse.hive.merge.size.per.task:Thisdefinesthesizeofmergedfilesattheendofthejob.Thedefaultvalueis256,000,000.hive.merge.smallfiles.avgsize:Thisisthethresholdfortriggeringfilemerge.Thedefaultvalueis16,000,000.

Whentheaverageoutputfilesizeofajobislessthanthevaluespecifiedbyhive.merge.smallfiles.avgsize,andbothhive.merge.mapfiles(formap-onlyjobs)andhive.merge.mapredfiles(forMapReducejobs)aresettotrue,HivewillstartanadditionalMapReducejobtomergetheoutputfilesintobigfiles.

www.it-ebooks.info



www.it-ebooks.info


JobandqueryoptimizationJobandqueryoptimizationcoversexperienceandskillstoimproveperformanceintheareaofjob-runningmode,JVMreuse,jobparallelrunning,andqueryoptimizationsinJOIN.

www.it-ebooks.info


LocalmodeHadoopcanruninstandalone,pseudo-distributed,andfullydistributedmode.Mostofthetime,weneedtoconfigureHadooptoruninfullydistributedmode.Whenthedatatoprocessissmall,itisanoverheadtostartdistributeddataprocessingsincethelaunchingtimeofthefullydistributedmodetakesmoretimethanthejobprocessingtime.SinceHive0.7.0,Hivesupportsautomaticconversionofajobtoruninlocalmodewiththefollowingsettings:

jdbc:hive2://>SEThive.exec.mode.local.auto=true;--defaultfalse

jdbc:hive2://>SEThive.exec.mode.local.auto.inputbytes.max=50000000;

jdbc:hive2://>SEThive.exec.mode.local.auto.input.files.max=5;

--default4

Ajobmustsatisfythefollowingconditionstoruninthelocalmode:

Thetotalinputsizeofthejobislowerthanhive.exec.mode.local.auto.inputbytes.max

Thetotalnumberofmaptasksislessthanhive.exec.mode.local.auto.input.files.max

Thetotalnumberofreducetasksrequiredis1or0

www.it-ebooks.info


JVMreuseBydefault,HadooplaunchesanewJVMforeachmaporreducejobandrunsthemaporreducetaskinparallel.Whenthemaporreducejobisalightweightjobrunningonlyforafewseconds,theJVMstartupprocesscouldbeasignificantoverhead.TheMapReduceframework(version1only,notYarn)hasanoptiontoreuseJVMbysharingtheJVMtorunmapper/reducerseriallyinsteadofparallel.JVMreuseappliestomaporreducetasksinthesamejob.TasksfromdifferentjobswillalwaysruninaseparateJVM.Toenablethereuse,wecansetthemaximumnumberoftasksforasinglejobforJVMreuseusingthemapred.job.reuse.jvm.num.tasksproperty.Itsdefaultvalueis1:

jdbc:hive2://>SETmapred.job.reuse.jvm.num.tasks=5;

Wecanalsosetthevalueto–1toindicatethatallthetasksforajobwillruninthesameJVM.

www.it-ebooks.info


ParallelexecutionHivequeriescommonlyaretranslatedintoanumberofstagesthatareexecutedbythedefaultsequence.Thesestagesarenotalwaysdependentoneachother.Instead,theycanruninparalleltosavetheoveralljobrunningtime.Wecanenablethisfeaturewiththefollowingsettings:

jdbc:hive2://>SEThive.exec.parallel=true;—defaultfalse

jdbc:hive2://>SEThive.exec.parallel.thread.number=16;

--default8,itdefinesthemaxnumberforrunninginparallel

Parallelexecutionwillincreasetheclusterutilization.Iftheutilizationofaclusterisalreadyveryhigh,parallelexecutionwillnothelpmuchintermsofoverallperformance.

www.it-ebooks.info


JoinoptimizationWehavealreadydiscussedoptimizationindifferenttypesofHivejoinsinChapter4,DataSelectionandScope.Here,we’llbrieflyreviewthekeysettingsforjoinimprovement.

CommonjoinThecommonjoinisalsocalledreducesidejoin.ItisabasicjoininHiveandworksformostofthetime.Forcommonjoins,weneedtomakesurethebigtableisontheright-mostsideorspecifiedbyhit,asfollows:

/*+STREAMTABLE(stream_table_name)*/.

MapjoinMapjoinisusedwhenoneofthejointablesissmallenoughtofitinthememory,soitisveryfastbutlimited.SinceHive0.7.0,Hivecanconvertmapjoinautomaticallywiththefollowingsettings:

jdbc:hive2://>SEThive.auto.convert.join=true;--defaultfalse

jdbc:hive2://>SEThive.mapjoin.smalltable.filesize=600000000;

--default25M

jdbc:hive2://>SEThive.auto.convert.join.noconditionaltask=true;

--defaultfalse.Settotruesothatmapjoinhintisnotneeded

jdbc:hive2://>SEThive.auto.convert.join.noconditionaltask.size=10000000;

--Thedefaultvaluecontrolsthesizeoftabletofitinmemory

Onceautoconvertisenabled,Hivewillautomaticallycheckifthesmallertablefilesizeisbiggerthanthevaluespecifiedbyhive.mapjoin.smalltable.filesize,andthenHivewillconvertthejointoacommonjoin.Ifthefilesizeissmallerthanthisthreshold,itwilltrytoconvertthecommonjoinintoamapjoin.Onceautoconvertjoinisenabled,thereisnoneedtoprovidethemapjoinhintsinthequery.

BucketmapjoinBucketmapjoinisaspecialtypeofmapjoinappliedonthebuckettables.Toenablebucketmapjoin,weneedtoenablethefollowingsettings:

jdbc:hive2://>SEThive.auto.convert.join=true;--defaultfalse

jdbc:hive2://>SEThive.optimize.bucketmapjoin=true;--defaultfalse

Inbucketmapjoin,allthejointablesmustbebuckettablesandjoinonbucketscolumns.Inaddition,thebucketsnumberinbiggertablesmustbeamultipleofthebucketnumberinthesmalltables.

Sortmergebucket(SMB)joinSMBisthejoinperformedonthebuckettablesthathavethesamesorted,bucket,andjoinconditioncolumns.Itreadsdatafrombothbuckettablesandperformscommonjoins(mapandreducetriggered)onthebuckettables.WeneedtoenablethefollowingpropertiestouseSMB:

www.it-ebooks.info


jdbc:hive2://>SEThive.input.format=

.......>org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

jdbc:hive2://>SEThive.auto.convert.sortmerge.join=true;

jdbc:hive2://>SEThive.optimize.bucketmapjoin=true;

jdbc:hive2://>SEThive.optimize.bucketmapjoin.sortedmerge=true;

jdbc:hive2://>SEThive.auto.convert.sortmerge.join.noconditionaltask=true;

Sortmergebucketmap(SMBM)joinSMBMjoinisaspecialbucketjoinbuttriggersmap-sidejoinonly.Itcanavoidcachingallrowsinthememorylikemapjoindoes.ToperformSMBMjoins,thejointablesmusthavethesamebucket,sort,andjoinconditioncolumns.Toenablesuchjoins,weneedtoenablethefollowingsettings:

jdbc:hive2://>SEThive.auto.convert.join=true;

jdbc:hive2://>SEThive.auto.convert.sortmerge.join=true

jdbc:hive2://>SEThive.optimize.bucketmapjoin=true;

jdbc:hive2://>SEThive.optimize.bucketmapjoin.sortedmerge=true;

jdbc:hive2://>SEThive.auto.convert.sortmerge.join.noconditionaltask=true;

jdbc:hive2://>SET

hive.auto.convert.sortmerge.join.bigtable.selection.policy=

org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSM

J;

SkewjoinWhenworkingwithdatathathasahighlyunevendistribution,thedataskewcouldhappeninsuchawaythatasmallnumberofcomputenodesmusthandlethebulkofthecomputation.ThefollowingsettinginformsHivetooptimizeproperlyifdataskewhappens:

jdbc:hive2://>SEThive.optimize.skewjoin=true;

--Ifthereisdataskewinjoin,setittotrue.Defaultisfalse.

jdbc:hive2://>SEThive.skewjoin.key=100000;

--Thisisthedefaultvalue.Ifthenumberofkeyisbiggerthan

--this,thenewkeyswillsendtotheotherunusedreducers.

NoteSkewdatacouldhappenontheGROUPBYdatatoo.Tooptimizeit,weneedtodothefollowingsettingstoenableskewdataoptimizationintheGROUPBYresult:

SEThive.groupby.skewindata=true;

Onceconfigured,HivewillfirsttriggeranadditionalMapReducejobwhosemapoutputwillrandomlydistributetothereducertoavoiddataskew.

FormoreinformationaboutHivejoinoptimization,pleaserefertotheApacheHivewikiavailableathttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimizationandhttps://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization.

www.it-ebooks.info

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization


www.it-ebooks.info


SummaryInthischapter,wefirstcoveredhowtoidentifyperformancebottlenecksusingtheEXPLAINandANALYZEstatements.Then,wespokeaboutthedesignoptimizationforperformancewhenusingtables,partition,andindex.Wealsocoveredthedatafileoptimizationincludingfileformat,compression,andstorage.Attheendofthischapter,wediscussedjobandqueryoptimizationinHive.Aftergoingthroughthischapter,weshouldbeabletodoperformancetroubleshootingandtuninginHive.

Inthenextchapter,we’lltalkaboutfunctionextensionsforHive.

www.it-ebooks.info


www.it-ebooks.info


Chapter8.ExtensibilityConsiderationsAlthoughHivehasmanybuilt-infunctions,userssometimeswillneedpowerbeyondthatprovidedbybuilt-infunctions.Fortheseinstances,Hiveoffersthefollowingthreemainareaswhereitsfunctionalitiescanbeextended:

User-definedfunction(UDF):Thisprovidesawaytoextendfunctionalitieswithanexternalfunction(mainlywritteninJava)thatcanbeevaluatedinHQLStreaming:Thisplugsinusers’owncustomizedmappersandreducersprogramsinthedatastreamingSerDe:ThisstandsforserializersanddeserializersandprovidesawaytoserializeordeserializeacustomfileformatwithfilesstoredonHDFS

Inthischapter,we’lltalkabouteachoftheminmoredetail.

www.it-ebooks.info


User-definedfunctionsHivedefinesthefollowingthreetypesofUDF:

UDFs:Theseareregularuser-definedfunctionsthatoperaterow-wiseandoutputoneresultforonerow,suchasmostbuilt-inmathematicandstringfunctions.UDAFs:Theseareuser-definedaggregatingfunctionsthatoperaterow-wiseorgroup-wiseandoutputoneroworonerowforeachgroupasaresult,suchastheMAXandCOUNTbuilt-infunctions.UDTFs:Theseareuser-definedtable-generatingfunctionsthatalsooperaterow-wise,buttheyproducemultiplerows/tablesasaresult,suchastheEXPLODEfunction.UDTFcanbeusedeitherafterSELECToraftertheLATERALVIEWstatement.

NoteSinceHiveisimplementedinJava,UDFsshouldbewritteninJavaaswell.SinceJavasupportsrunningcodeinotherlanguagesthroughthejavax.scriptAPI(seehttp://docs.oracle.com/javase/6/docs/api/javax/script/package-summary.html),UDFscanbewritteninlanguagesotherthanJava.Inthisbook,weonlyfocusonJavaUDFs.

We’llstartlookingattheJavacodetemplateforeachkindoffunctioninmoredetail.

www.it-ebooks.info

http://docs.oracle.com/javase/6/docs/api/javax/script/package-summary.html


TheUDFcodetemplateThecodetemplateforaregularUDFisasfollows:


importorg.apache.hadoop.hive.ql.exec.UDF;

importorg.apache.hadoop.hive.ql.exec.Description;

importorg.apache.hadoop.hive.ql.udf.UDFType;

//Belowareoptionsoraddmorewhenneeded

importorg.apache.hadoop.io.Text;

importorg.apache.commons.lang.StringUtils;

@Description(

name="udf_name",

value="_FUNC_(arg1,arg2,...argN)-Ashortdescriptionforthe

function",

extended="Thisismoredetailaboutthefunction,suchassyntax,

examples."

)

@UDFType(deterministic=true,stateful=false)

publicclassudf_nameextendsUDF{

publicStringevaluate(){

/*

*Dosomethinghere

*/

return"returntheudfresult";

}

//overrideissupported

publicStringevaluate(<Type_arg1>arg1,...,<Type_argN>argN){

/*

*Dosomethinghere

*/

return"returntheudfresult";

}

}

Intheprecedingtemplate,thepackagedefinitionandimportsshouldbeself-explanatory.Wecanimportwhateverisneededbesidesthetopthreemandatorylibraries.The@DescriptionannotationisausefulHivespecificannotationtoprovideusageinformationfortheUDFintheHiveconsole.TheinformationdefinedinthevaluepropertywillbeshownintheHQLDESCRIBEFUNCTIONcommand.TheinformationdefinedintheextendedpropertywillbeshownintheHQLDESCRIBEFUNCTIONEXTENDEDcommand.The@UDFTypeannotationtellsHivewhatbehaviortoexpectfromthefunction.AdeterministicUDF(deterministic=true)isafunctionthatalwaysgivesthesameresultwhenpassedthesamearguments,suchasLENGTH(stringinput),MAX(),andsoon.Ontheotherhand,anon-deterministic(deterministic=false)UDFcanreturnadifferentresultforthesamesetofarguments,forexample,UNIX_TIMESTAMP()returningthecurrenttimestampinthedefaulttimezone.Thestateful(stateful=true)propertyallowsfunctionstokeepsomestaticvariablesavailableacrossrows,suchas

www.it-ebooks.info


ROW_NUMBER(),whichassignssequentialnumbersforallrowsinatable.

AllUDFsextendtheHiveUDFclass,sotheUDFsubclassmustimplementtheevaluatemethodcalledbyHive.Theevaluatemethodcanbeoverriddenforadifferentpurpose.Inthismethod,wecanimplementwhateverlogicandexceptionhandlingthedesignforthefunctionusingtheJavaHadooplibraryandtheHadoopdatatypeforMapReducedataserialization,suchasTEXT,DoubleWritable,INTWritable,andsoon.

www.it-ebooks.info


TheUDAFcodetemplateInthissection,weintroducetheUDAFcodetemplatebyextendingitfromtheUDAFclass.Thecodetemplateisasfollows:

packagecom.packtpub.hive.essentials.hiveudaf;

importorg.apache.hadoop.hive.ql.exec.UDAF;

importorg.apache.hadoop.hive.ql.exec.UDAFEvaluator;



@Description(

name="udaf_name",


function",


examples."

)

@UDFType(deterministic=false,stateful=true)

publicfinalclassudaf_nameextendsUDAF{

/**

*Theinternalstateofanaggregationfunction.

*

*Notethatthisisonlyneedediftheinternalstate

*cannotberepresentedbyaprimitive.

*

*Theinternalstatecancontainfieldswithtypeslike

*ArrayList<String>andHashMap<String,Double>ifneeded.

*/

publicstaticclassUDAFState{

private<Type_state1>state1;

private<Type_stateN>stateN;

}

/**

*Theactualclassfordoingtheaggregation.Hivewill

*automaticallylookforallinternalclassesoftheUDAF

*thatimplementsUDAFEvaluator.

*/

publicstaticclassUDAFExampleAvgEvaluatorimplementsUDAFEvaluator{

UDAFStatestate;

publicUDAFExampleAvgEvaluator(){

super();

state=newUDAFState();

init();

}

/**

*Resetthestateoftheaggregation.

*/

publicvoidinit(){

www.it-ebooks.info


/*

*Examplesforinitializingstate.

*/

state.state1=0;

state.stateN=0;

}

/**

*Iteratethroughonerowoforiginaldata.

*

*Thenumberandtypeofargumentsneedtobethesameaswe

*callthisUDAFfromtheHivecommandline.

*

*Thisfunctionshouldalwaysreturntrue.

*/

publicbooleaniterate(<Type_arg1>arg1,...,<Type_argN>argN)

{

/*

*Addlogichereforhowtodoaggregationifthereis

*anewvaluetobeaggregated.

*/

returntrue;

}

/**

*Calledonthemappersideondifferentdatanodes.

*Terminateapartialaggregationandreturnthestate.

*Ifthestateisaprimitive,justreturnprimitiveJava

*classeslikeIntegerorString.

*/

publicUDAFStateterminatePartial(){

/*

*Checkandreturnapartialresultinexpectations.

*/

returnstate;

}

/**

*Mergewithapartialaggregation.

*

*Thisfunctionshouldalwayshaveasingleargument,

*whichhasthesametypeasthereturnvalueof

*terminatePartial().

*/

publicbooleanmerge(UDAFStateo){

/*

*Defineoperationshowtomergetheresultcalculated

*fromalldatanodes.

*/

returntrue;

}

/**

*Terminatestheaggregationandreturnsthefinalresult.

*/

publiclongterminate(){

www.it-ebooks.info


/*

*Checkandreturnfinalresultinexpectations.

*/

returnstate.stateN;

}

}

}

AUDAFmustbeasubclassoforg.apache.hadoop.hive.ql.exec.UDAFcontainingoneormorenestedstaticclassesimplementingorg.apache.hadoop.hive.ql.exec.UDAFEvaluator.MakesurethattheinnerclassthatimplementsUDAFEvaluatorisdefinedaspublic.Otherwise,Hivewon’tbeabletousereflectionanddeterminetheUDAFEvaluatorimplementation.Weshouldalsoimplementthefiverequiredfunctions,init,iterate,terminatePartial,merge,andterminate,alreadydescribedinthecodecomments.

NoteBothUDFandUDAFcanalsobeimplementedbyextendingfromtheGenericUDFandGenericUDAFEvaluatorclassestoavoidusingJavareflectionforbetterperformance.And,thesegenericfunctionsareactuallyextendedbyHive’sbuilt-inUDFsimplementationsinternally.Genericfunctionssupportcomplexdatatypes,suchasMAP,ARRAY,andSTRUCT,asarguments,buttheUDFandUDAFclassdonot.FormoreinformationaboutGenericUDAF,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy.

www.it-ebooks.info

https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy


TheUDTFcodetemplateToimplementUDTF,thereisonlyonewaybyextendingfromorg.apache.hadoop.hive.ql.exec.GenericUDTF.ThereisnoplainUDTFclass.Weneedtoimplementthreemethods:initialize,process,andclose.TheUDTFwillcalltheinitializemethod,whichreturnstheinformationofthefunctionoutput,suchasdatatype,numberofoutput,andsoon.Then,theprocessmethodiscalledtodocorefunctionlogicwithargumentsandforwardtheresult.Attheend,theclosemethodwilldoapropercleanup,ifneeded.ThecodetemplateforUDTFisasfollows:

packagecom.packtpub.hive.essentials.hiveudtf;

importorg.apache.hadoop.hive.ql.udf.generic.GenericUDTF;


importorg.apache.hadoop.hive.ql.exec.UDFArgumentException;

importorg.apache.hadoop.hive.ql.metadata.HiveException;

importorg.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;

import

org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;

import

org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;

importorg.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;

import

org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInsp

ectorFactory;

@Description(

name="udtf_name",


function",


examples."

)

publicclassudtf_nameextendsGenericUDTF{

privatePrimitiveObjectInspectorstringOI=null;

/**

*Thismethodwillbecalledexactlyonceperinstance.

*Itperformsanycustominitializationlogicweneed.

*Itisalsoresponsibleforverifyingtheinputtypesand

*specifyingtheoutputtypes.

*/

@Override

publicStructObjectInspectorinitialize(ObjectInspector[]args)

throwsUDFArgumentException{

//Checknumberofarguments.

if(args.length!=1){

thrownewUDFArgumentException("TheUDTFshouldtakeexactlyone

argument");

}

/*

*CheckthattheinputObjectInspector[]arraycontainsa

*singlePrimitiveObjectInspectorofthePrimitivetype,

*suchasString.

www.it-ebooks.info


*/

if(args[0].getCategory()!=ObjectInspector.Category.PRIMITIVE

&&

((PrimitiveObjectInspector)args[0]).getPrimitiveCategory()!=

PrimitiveObjectInspector.PrimitiveCategory.STRING){

thrownewUDFArgumentException("TheUDTFshouldtakeastringasa

parameter");

}

stringOI=(PrimitiveObjectInspector)args[0];

/*

*Definetheexpectedoutputforthisfunction,including

*eachaliasandtypesforthealiases.

*/

List<String>fieldNames=newArrayList<String>(2);

List<ObjectInspector>fieldOIs=newArrayList<ObjectInspector>(2);

fieldNames.add("alias1");

fieldNames.add("alias2");

fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector);

//Setuptheoutputschema.

return

ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,

fieldOIs);

}

/**

*Thismethodiscalledonceperinputrowandgenerates

*output.The"forward"methodisused(insteadof

*"return")inordertospecifytheoutputfromthefunction.

*/

@Override

publicvoidprocess(Object[]record)throwsHiveException{

/*

*Wemayneedtoconverttheobjecttoaprimitivetype

*beforeimplementingcustomizedlogic.

*/

finalStringrecStr=(String)

stringOI.getPrimitiveJavaObject(record[0]);

//emitnewlycreatedstructsafterapplyingcustomizedlogic.

forward(newObject[]{recStr,Integer.valueOf(1)});

}

/**

*Thismethodisforanycleanupthatisnecessarybefore

*returningfromtheUDTF.Sincetheoutputstreamhas

*alreadybeenclosedatthispoint,thismethodcannot

*emitmorerows.

*/

@Override

publicvoidclose()throwsHiveException{

//Donothing.

}

}

www.it-ebooks.info


DevelopmentanddeploymentWe’llgothroughthewholedevelopmentanddeploymentstepsusinganexample.Let’screateaHivefunctioncalledtoUpper,whichwillconvertastringtouppercaseusingthefollowingsteps:

1. DownloadandinstallaJavaIDE,suchasEclipse,fromhttp://www.eclipse.org/downloads/packages/eclipse-ide-java-developers/lunasr1.

2. StarttheIDEandcreateaJavaproject.3. Right-clickontheprojecttochoosetheBuildPath|ConfigureBuildPath|Add

ExternalJarsoption.Itwillopenanewwindow.NavigatetothedirectoryhavingthelibraryofHiveandHadoop.Then,selectandaddallJARfilesneededtoimport.WecanalsoresolvelibrarydependencyautomaticallybyusingMaven(seehttp://maven.apache.org/)andtheproperpom.xmlfile.Howtoconfigurealibraryrepositoryinpom.xmlfilesisusuallywelldescribedintheHadoopvendorpackageorApacheHiveandHadoophelpdocuments.

4. IntheIDE,createthetoupper.javafileasfollows,accordingtotheUDFtemplatementionedpreviously:



importorg.apache.hadoop.io.Text;

classToUpperextendsUDF{

publicTextevaluate(Textinput){

if(input==null)returnnull;

returnnewText(input.toString().toUpperCase());

}

}

5. Now,exportthisprojectasaJARfile(orbuiltbyMaven)namedastoupper.jar.6. CopythisJARfileinadirectory,suchas/home/dayongd/hive/lib/,inanodeof

theHivecluster.7. AddtheJARtotheHiveenvironmentusingoneofthefollowingoptions(option3or

4isrecommended):

Option1:RunADDJAR/home/dayongd/hive/lib/toupper.jarintheHiveCLI.Thisisonlyvalidforthecurrentsession,butdoesnotworkforODBCconnections.Option2:AddADDJAR/home/dayongd/hive/lib/toupper.jarin/home/$USER/.hiverc(wecancreatethefileifitisnotthere).Inthiscase,thefileneedstobedeployedtoeverynodefromwherewemightlaunchtheHiveshell.Thisisonlyvalidforthecurrentsession,butdoesnotworkforODBCconnections.Option3:Addthefollowingconfigurationinthehive-site.xmlfile:

<property>

www.it-ebooks.info

http://www.eclipse.org/downloads/packages/eclipse-ide-java-developers/lunasr1

http://maven.apache.org/


<name>hive.aux.jars.path</name>

<value>file:///home/dayongd/hive/lib/toupper.jar</value>

</property>

Option4:CopyandpastetheJARfiletothe/${HIVE_HOME}/auxlib/folder(createitifitdoesnotexist).

8. Createthefunction.WecancreateatemporaryfunctionthatisonlyvalidinthecurrentHivesessionasfollows:

CREATETEMPORARYFUNCTIONtoUpperAS

'com.packtpub.hive.essentials.hiveudf.toupper';

NoteSinceHive0.13.0,wecanuseonecommandtoaddJARandcreatepermanentfunctions,whichisregisteredtothemegastoreandcanbereferencedinaquerywithoutcreatingatemporaryfunctionineachsession:

CREATEFUNCTIONtoUpperAS

'com.packtpub.hive.essentials.hiveudf.ToUpper'USINGJAR

'hdfs:///path/to/jar';

9. Verifythefunction:

SHOWFUNCTIONSToUpper;

DESCRIBEFUNCTIONToUpper;

DESCRIBEFUNCTIONEXTENDEDToUpper;

10. UsetheUDFinHQL:

SELECTtoUpper(name)FROMemployeeLIMIT1000;

11. Dropthefunctionwhenneeded:

DROPTEMPORARYFUNCTIONIFEXISTStoUpper;

www.it-ebooks.info


www.it-ebooks.info


StreamingHivecanalsoleveragethestreamingfeatureinHadooptotransformdatainanalternativeway.ThestreamingAPIopensanI/Opipetoanexternalprocess(script).Then,theprocessreadsdatafromthestandardinputandwritestheresultsoutthroughthestandardoutput.InHive,wecanuseTRANSFORMclausesinHQLdirectly,andembedthemapperandthereducerscriptswrittenincommands,shellscripts,Java,orotherprogramminglanguages.Althoughstreamingbringsoverheadbyusingserialization/deserializationbetweenprocesses,itisasimplercodingmodefordevelopers,especiallynon-Javadevelopers.ThesyntaxoftheTRANSFORMclauseisasfollows:

FROM(

FROMsrc

SELECTTRANSFORM'('expression(','expression)*')'

(inRowFormat)?

USING'map_user_script'

(AScolName(','colName)*)?

(outRowFormat)?(outRecordReader)?

(CLUSTERBY?|DISTRIBUTEBY?SORTBY?)src_alias

)

SELECTTRANSFORM'('expression(','expression)*')'

(inRowFormat)?

USING'reduce_user_script'

(AScolName(','colName)*)?

(outRowFormat)?(outRecordReader)?

Bydefault,theINPUTvaluesfortheuserscriptarethefollowing:

ColumnstransformedtoSTRINGvaluesDelimitedbyatabNULLvaluesconvertedtotheliteralstringN(differentiatesNULLvaluesfromemptystrings)

Bydefault,theOUTPUTvaluesoftheuserscriptarethefollowing:

Treatedastab-separatedSTRINGcolumnsNwillbereinterpretedasNULLTheresultingSTRINGcolumnwillbecasttothedatatypespecifiedinthetabledeclaration

ThesedefaultscanbeoverriddenwithROWFORMAT.AnexampleofHivestreamingusingthePythonscriptupper.pyisasfollows:

#!/usr/bin/envpython

'''

Thisisascripttoupperallcases

'''

importsys

defmain():

try:

forlineinsys.stdin:

www.it-ebooks.info


www.it-ebooks.info


SerDeSerDestandsforSerializerandDeserializer.ItisthetechnologythatHiveusestoprocessrecordsandmapthemtocolumndatatypesinHivetables.ToexplainthescenarioofusingSerDe,weneedtounderstandhowHivereadsandwritesdata.

Theprocesstoreaddataisasfollows:

1. DataisreadfromHDFS.2. DataisprocessedbytheINPUTFORMATimplementation,whichdefinestheinputdata

splitandkey/valuerecords.InHive,wecanuseCREATETABLE…STOREDAS<FILE_FORMAT>(seeChapter7,PerformanceConsiderations,foravailablefileformats)tospecifywhichINPUTFORMATitreadsfrom.

3. TheJavaDeserializerclassdefinedinSerDeiscalledtoformatthedataintoarecordthatmapstocolumnanddatatypesinatable.

Foranexampleofreadingdata,wecanuseJSONSerDetoreadtheTEXTFILEformatdatafromHDFSandtranslateeachrowoftheJSONattributeandvaluetorowsinHivetableswiththecorrectschema.

Theprocesstowritedataisasfollows:

1. Data(suchasusinganINSERTstatement)tobewrittenistranslatedbytheSerializerclassdefinedinSerDetotheformatthattheOUTPUTFORMATclasscanread.

2. DataisprocessedbytheOUTPUTFORMATimplementation,whichcreatestheRecordWriterobject.SimilartotheINPUTFORMATimplementation,theOUTPUTFORMATimplementationisspecifiedinthesamewayasatablewhereitwritesthedata.

3. Thedataiswrittentothetable(datasavedintheHDFS).

Foranexampleofwritingdata,wecanwritearow-columnofdatatoHivetablesusingJSONSerDe,whichtranslatesdatatoaJSONtextstringsavedtotheHDFS.

RecentHiveversionsusestheorg.apache.hadoop.hive.serde2library,whereorg.apache.hadoop.hive.serdeisthedeprecatedlibrary.AlistofcommonlyusedSerDeinHiveisasfollows:

LazySimpleSerDe:Thedefaultbuilt-inSerDe(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)that’susedwiththeTEXTFILEformat.Itcanbeimplementedasfollows:

jdbc:hive2://>CREATETABLEtest_serde_lz

.......>STOREDASTEXTFILEAS

.......>SELECTnamefromemployee;


ColumnarSerDe:Thisisthebuilt-inSerDeusedwiththeRCFILEformat.Itcanbeusedasfollows:

www.it-ebooks.info


jdbc:hive2://>CREATETABLEtest_serde_cs

.......>ROWFORMATSERDE

.......>'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'

.......>STOREDASRCFileAS



RegexSerDe:Thisisthebuilt-inJavaregularexpressionSerDetoparsetextfiles.Itcanbeusedasfollows:

--Parse,seperatefields

jdbc:hive2://>CREATETABLEtest_serde_rex(

.......>namestring,

.......>sexstring,

.......>agestring

.......>)


.......>'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

.......>WITHSERDEPROPERTIES(

.......>'input.regex'='([^,]*),([^,]*),([^,]*)',

.......>'output.format.string'='%1$s%2$s%3$s'

.......>)



HBaseSerDe:Thisisthebuilt-inSerDetoenableHivetointegratewithHBase.WecanstoreHivetablesinHBasebyleveragingthisSerDe.MakesuretohaveHBaseinstalledbeforerunningthefollowingquery:

jdbc:hive2://>CREATETABLEtest_serde_hb(

.......>idstring,

.......>namestring,

.......>sexstring,

.......>agestring

.......>)


.......>'org.apache.hadoop.hive.hbase.HBaseSerDe'

.......>STOREDBY

.......>'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

.......>WITHSERDEPROPERTIES(

.......>"hbase.columns.mapping"=

.......>":key,info:name,info:sex,info:age"

.......>)

.......>TBLPROPERTIES("hbase.table.name"="test_serde");


AvroSerDe:Thisisthebuilt-inSerDethatenablesreadingandwritingAvro(seehttp://avro.apache.org/)datainHivetables.Avroisaremoteprocedurecallanddataserializationframework.SinceHive0.14.0,Avro-backedtablescansimplybecreatedbyusingtheCREATETABLE…STOREDASAVROstatement,asfollows:

jdbc:hive2://>CREATETABLEtest_serde_avro(

.......>namestring,

.......>sexstring,

.......>agestring

www.it-ebooks.info

http://avro.apache.org/


.......>)


.......>'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

.......>STOREDASINPUTFORMAT

.......>

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

.......>OUTPUTFORMAT

.......>

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

.......>;


ParquetHiveSerDe:Thisisthebuilt-inSerDe(parquet.hive.serde.ParquetHiveSerDe)thatenablesreadingandwritingtheParquetdataformatsinceHive0.13.0.Itcanbeusedasfollows:

jdbc:hive2://>CREATETABLEtest_serde_parquet

.......>STOREDASPARQUETAS



OpenCSVSerDe:ThisistheSerDetoreadandwriteCSVdata.Itcomesasabuilt-inSerDesinceHive0.14.0.Wecanalsoinstalltheimplementationfromotheropensourcelibraries,suchashttps://github.com/ogrodnek/csv-serde.Itcanbeusedasfollows:

jdbc:hive2://>CREATETABLEtest_serde_csv(

.......>namestring,

.......>sexstring,

.......>agestring

.......>)


.......>'org.apache.hadoop.hive.serde2.OpenCSVSerde'


JSONSerDe:Thisisathird-partySerDetoreadandwriteJSONdatarecordswithHive.Makesuretoinstallit(fromhttps://github.com/rcongiu/Hive-JSON-Serde)beforerunningthefollowingquery:

jdbc:hive2://>CREATETABLEtest_serde_js(

.......>namestring,

.......>sexstring,

.......>agestring

.......>)

.......>ROWFORMATSERDE'org.openx.data.jsonserde.JsonSerDe'



HivealsoallowsuserstodefineacustomizedSerDeifnoneoftheseworkfortheirdataformat.FormoreinformationaboutcustomSerDe,pleaserefertotheApachewikiathttps://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HowtoWriteYourOwnSerDe.

www.it-ebooks.info

https://github.com/ogrodnek/csv-serde

https://github.com/rcongiu/Hive-JSON-Serde

https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HowtoWriteYourOwnSerDe


www.it-ebooks.info


SummaryInthischapter,weintroducedthreemainareastoextendHive’sfunctionalities.Wealsocoveredthreeuser-definedfunctionsinHiveaswellasthecodingtemplateanddeploymentstepstoguideyourcodinganddeploymentpractice.Then,wetalkedaboutstreaminginHivetopluginyourowncode,whichdoesnothavetobeJavacode.Attheendofthischapter,wediscussedtheavailableSerDeinHivetoparsedifferentformatsofdatafileswhenreadingorwritingdata.Aftergoingthroughthischapter,weshouldbeabletowritebasicUDFs,plugcodeinstreamings,anduseavailableSerDeinHive.

Inthenextchapter,we’lltalkaboutsecurityconsiderationsforHive.

www.it-ebooks.info


www.it-ebooks.info


Chapter9.SecurityConsiderationsInmostopensourcesoftware,securityisoneofthemostimportantareas,butalwaysaddressedatalaterstage.AsthemainSQL-likeinterfaceofdatainHadoop,Hivemustensurethatdataissecurelyprotectedandaccessed.Forthisreason,securityinHiveisnowconsideredasanintegralandimportantpartoftheHadoopecosystem.TheearlierversionofHivemainlyreliedontheHDFSforsecurity.ThesecurityofHivegraduallybecamematureafterHiveServer2wasreleasedasanimportantmilestoneoftheHiveserver.

ThischapterwilldiscussHivesecurityinthefollowingareas:

AuthenticationAuthorizationEncryption

www.it-ebooks.info


AuthenticationAuthenticationistheprocessofverifyingtheidentityofauserbyobtainingtheuser’scredentials.HivehasofferedauthenticationsinceHiveServer2.InthepreviousHiveServer,ifwecouldaccessthehost/portoverthenetwork,wecouldaccessthedata.Inthiscase,theHiveMetastoreServercanbeusedtoauthenticatethriftclientsusingKerberos.AsmentionedinChapter2,SettingUptheHiveEnvironment,itisstronglyrecommendedtoupgradetheHiveservertoHiveServer2intermsofsecurityandreliability.Inthissection,wewillbrieflytalkaboutauthenticationconfigurationsinbothMetastoreServerandHiveServer2.

NoteKerberos

KerberosisanetworkauthenticationprotocoldevelopedbyMITaspartofProjectAthena.Itusestime-sensitiveticketsthataregeneratedusingsymmetrickeycryptographytosecurelyauthenticateauserinanunsecurednetworkenvironment.KerberosisderivedfromGreekmythology,whereKerberoswasthethree-headeddogthatguardedthegatesofHades.Thethree-headedpartreferstothethreepartiesinvolvedintheKerberosauthenticationprocess:client,server,andKeyDistributionCenter(KDC).AllclientsandserversregisteredtoKDCareknownasarealm,whichistypicallythedomain’sDNSnameinallcaps.Formoreinformation,pleaserefertotheMITKerberoswebsiteathttp://web.mit.edu/kerberos/.

www.it-ebooks.info

http://web.mit.edu/kerberos/


MetastoreserverauthenticationToforceclientstoauthenticatewiththeHiveMetastoreserverusingKerberos,wecansetthefollowingpropertiesinthehive-site.xmlfile:

EnabletheSimpleAuthenticationandSecurityLayer(SASL)frameworktoenforceclientKerberosauthentication,asfollows:

<property>

<name>hive.metastore.sasl.enabled</name>

<value>true</value>

<description>Iftrue,themetastorethriftinterfacewillbesecured

withSASLframework.ClientsmustauthenticatewithKerberos.

</description>

</property>

SpecifytheKerberoskeytabthatisgenerated.Overridethefollowingexampleifwewanttokeepthefileinanotherplace.Makesurethefileaccesspermissionsaresetto400implyingonlyreadpermissionfortheownertoavoidtheiridentitybeingstolenbyothers:

<property>

<name>hive.metastore.kerberos.keytab.file</name>

<value>/etc/hive/conf/hive.keytab</value>

<description>ThesamplepathtotheKerberosKeytabfilecontaining

themetastorethriftserver'sserviceprincipal.</description>

</property>

SpecifytheKerberosprincipalpatternstring.Thespecialstring_HOSTwillbereplacedautomaticallywiththecorrecthostnames.TheYOUR-REALM.COMvalueshouldbereplacedbytheactualrealmname:

<property>

<name>hive.metastore.kerberos.principal</name>

<value>hive/[email protected]</value>

<description>Theserviceprincipalforthemetastorethriftserver.

</description>

</property>

www.it-ebooks.info


HiveServer2authenticationHiveServer2supportsthefollowingauthentications.ToconfigureHiveServer2touseoneoftheseauthenticationmodes,wecansettheproperpropertiesinhive_site.xmlasfollows:

Noneauthentication:Noneauthenticationiswhat’sinthedefaultsettings.“None”heremeansHiveallowsanonymousaccessasshowninthefollowingsetting:

<property>

<name>hive.server2.authentication</name>

<value>NONE</value>

</property>

Kerberosauthentication:IfKerberosauthenticationisused,authenticationissupportedbetweenthethriftclientandHiveServer2,andbetweenHiveServer2andsecureHDFS.ToenableKerberosauthenticationforHiveServer2,wecansetthefollowingpropertiesbyoverridingthekeytabpath(ifwewanttokeepthefileinanotherplace)aswellaschangingYOUR-REALM.COMtotheactualrealmname:

<property>


<value>KERBEROS</value>

</property>

<property>

<name>hive.server2.authentication.kerberos.keytab</name>

<value>/etc/hive/conf/hive.keytab</value>

</property>

<property>

<name>hive.server2.authentication.kerberos.principal</name>

<value>hive/[email protected]</value>

</property>

OnceKerberosisenabled,theJDBCclient(suchasBeeline)mustincludetheprincipalparameterintheJDBCconnectionstringsuchasthefollowing:

jdbc:hive2://HiveServer2HostName:10000/default;principal=hive/HiveServe

[email protected]

LDAPauthentication:ToconfigureHiveServer2touseuserandpasswordvalidationbackedbyLDAP(seehttp://tools.ietf.org/html/rfc4511),wecansetthefollowingproperties:

<property>


<value>LDAP</value>

</property>

<property>

<name>hive.server2.authentication.ldap.url</name>

<value>LDAP_URL,suchasldap://[email protected]</value>

</property>

<property>

<name>hive.server2.authentication.ldap.Domain</name>

<value>YourDomainName</value>

www.it-ebooks.info

http://tools.ietf.org/html/rfc4511


</property>

ToconfigurewithOpenLDAP,wecanaddthesettingofbaseDNinsteadoftheDomainpropertyasfollows:

<property>

<name>hive.server2.authentication.ldap.baseDN</name>

<value>LDAP_BaseDN,suchasou=people,dc=packtpub,dc=com</value>

</property>

Pluggablecustomauthentication:PluggablecustomauthenticationprovidesacustomauthenticationproviderforHiveServer2.Toenableit,configurethesettingsasfollows:

<property>


<value>CUSTOM</value>

</property>

<property>

<name>hive.server2.custom.authentication.class</name>

<value>pluggable-auth-class-name</value>

<description>Customauthenticationclassname,suchas

com.packtpub.hive.essentials.hiveudf.customAuthenticator

</description>

</property>

NoteThepluggableauthenticationwithacustomizedclassdidnotworkuntilthebug(seehttps://issues.apache.org/jira/browse/HIVE-4778)wasfixedinHive0.13.0.

Thefollowingisasampleofacustomizedclassthatimplementstheorg.apache.hive.service.auth.PasswdAuthenticationProviderinterface.TheoverriddenAuthenticatemethodhasthecorelogicofhowtoauthenticateausernameandpassword.MakesuretocopythecompiledJARfileto$HIVE_HOME/lib/sothattheprecedingsettingscanwork.

customAuthenticator.java


importjava.util.Hashtable;

importjavax.security.sasl.AuthenticationException;

importorg.apache.hive.service.auth.PasswdAuthenticationProvider;

/*

*ThecustomizedclassforHiveServer2authentication

*/

publicclasscustomAuthenticatorimplements

PasswdAuthenticationProvider{

Hashtable<String,String>authHashTable=null;

publiccustomAuthenticator(){

authHashTable=newHashtable<String,String>();

www.it-ebooks.info



authHashTable.put("user1","passwd1");

authHashTable.put("user2","passwd2");

}

@Override

publicvoidAuthenticate(Stringuser,Stringpassword)

throwsAuthenticationException{

StringstoredPasswd=authHashTable.get(user);

if(storedPasswd!=null&&storedPasswd.equals(password))

return;

thrownewAuthenticationException("customAuthenticatorException:

Invaliduser");

}

}

PluggableAuthenticationModules(PAM)authentication:SinceHive0.13.0,itsupportsPAMauthentication,whichprovidesthebenefitofpluggingexistingauthenticationmechanismstoHive.ConfigurethefollowingsettingstoenablePAMauthentication.FormoreinformationabouthowtoinstallPAM,pleaserefertotheSettingUpHiveServer2articleintheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PluggableAuthenticationModules(PAM).

<property>


<value>PAM</value>

</property>

<property>

<name>hive.server2.authentication.pam.services</name>

<value>pluggable-auth-class-name</value>

<description>Setthistoalistofcomma-separatedPAMservicesthat

willbeused.NotethatafilewiththesamenameasthePAMservice

mustexistin/etc/pam.d.</description>

</property>

www.it-ebooks.info

https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PluggableAuthenticationModules(PAM)


www.it-ebooks.info


AuthorizationAuthorizationinHiveisusedtoverifyifauserhaspermissiontoperformacertainaction,suchascreating,reading,andwritingdataormetadata.Hiveprovidesthreeauthorizationmodes:legacymode,storage-basedmode,andSQLstandard-basedmode.

www.it-ebooks.info


LegacymodeThisisthedefaultauthorizationmodeinHive,providingcolumnandrow-levelauthorizationthroughHQLstatements.However,itisnotacompletelysecureauthorizationmodeandhasacoupleoflimitations.Itcanbemainlyusedtopreventgoodusersfromaccidentallydoingbadthingsratherthanpreventingmalicioususers’operations.Inordertoenablethelegacyauthorizationmode,weneedtosetthefollowingpropertiesinhive-site.xml:

<property>

<name>hive.security.authorization.enabled</name>

<value>true</value>

<description>enablesordisablethehiveclientauthorization

</description>

</property>

<property>

<name>hive.security.authorization.createtable.owner.grants</name>

<value>ALL</value>

<description>theprivilegesautomaticallygrantedtotheownerwhenevera

tablegetscreated.Anexamplelike"select,drop"willgrantselectand

dropprivilegetotheownerofthetable.

</description>

</property>

Sincethisisnotasecureauthorizationmode,wewillnotdiscussmoredetailshere.FormoreHQLsupportinthelegacyauthorizationmode,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Hive+Default+Authorization+-+Legacy+Mode.

www.it-ebooks.info

https://cwiki.apache.org/confluence/display/Hive/Hive+Default+Authorization+-+Legacy+Mode


Storage-basedmodeThestorage-basedauthorizationmode(sinceHive0.10.0)reliesontheauthorizationprovidedbythestoragelayerHDFS,whichprovidesbothPOSIXandACLpermissions(availablesinceHive0.14.0;refertohttps://issues.apache.org/jira/browse/HIVE-7583).Thestorage-basedauthorizationisenabledintheHiveMetastoreserverhavingasingleconsistentviewofmetadataacrossotherapplicationsintheecosystem.ThismodechecksHiveuserpermissionsagainstthePOSIXpermissionsonthecorrespondingfiledirectoriesinHDFS.InadditiontothePOSIXpermissionsmodel,HDFSalsoprovidesaccesscontrollistsdescribedinACLsonHDFSathttp://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#ACLs_Access_Control_Lists.Consideringitsimplementation,thestorage-basedauthorizationmodeonlyoffersauthorizationatthelevelofHivedatabases,tables,andpartitionsratherthancolumnandrowlevel.WithdependencyontheHDFSpermissions,itlackstheflexibilitytomanagetheauthorizationthroughHQLstatements.

Toenablestorage-basedauthorizationmode,wecansetthefollowingpropertiesinthehive-site.xmlfile:

<property>


<value>true</value>

<description>enableordisablethehiveclientauthorization

</description>

</property>

<property>

<name>hive.security.authorization.manager</name>

<value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthori

zationProvider</value>

<description>TheclassnameoftheHiveclientauthorizationmanager.

</description>

</property>

<property>

<name>hive.server2.enable.doAs</name>

<value>true</value>

<description>AllowsHivequeriestoberunbytheuserwhosubmitsthe

queryratherthanthehiveuser.</description>

</property>

</property>

<name>hive.metastore.pre.event.listeners</name>

<value>org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEve

ntListener</value>

<description>Thisturnsonmetastore-sidesecurity.</description>

</property>

<property>

<name>hive.security.metastore.authorization.manager</name>

<value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthori

zationProvider</value>

<description>authenticatormanagerclassnametobeusedinthe

www.it-ebooks.info


http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#ACLs_Access_Control_Lists


metastoreforauthentication.</description>

</property>

NoteSinceHive0.14.0,storage-basedauthorizationalsoauthorizesreadprivilegesondatabasesandtablesbydefaultthroughthehive.security.metastore.authorization.auth.readsproperty.Formoreinformation,pleaserefertohttps://issues.apache.org/jira/browse/HIVE-8221.

www.it-ebooks.info



SQLstandard-basedmodeForfine-grainedaccesscontrolonacolumnandrowlevel,wecanuseSQLstandard-basedmodeavailablesinceHive0.13.0.ItissimilartotheSQLauthorizationbyusingtheGRANTandREVOKEstatementstocontrolaccessthroughtheHiveServer2configuration.However,toolssuchasHiveCLIandHadoop/HDFS/MapReducecommandsdonotaccessdatathroughHiveServer2,soSQLstandard-basedmodecannotauthorizetheiraccess.Therefore,itisrecommendedtousestorage-basedmodetogetherwithSQLstandard-basedmodeauthorizationtoauthorizeuserswhodonotaccessfromHiveServer2.

ToenableSQLstandard-basedmodeauthorization,wecansetthefollowingpropertiesinthehive-site.xmlfile:

<property>

<name>hive.server2.enable.doAs</name>

<value>false</value>

<description>AllowsHivequeriestoberunbytheuserwhosubmitsthe

queryratherthanthehiveuser.NeedtoturnifoffforthisSQLstandard-

basemode</description>

</property>

<property>

<name>hive.users.in.admin.role</name>

<value>dayongd,administrator</value>

<description>Comma-separatedlistofusersassignedtotheADMINrole.

</description>

</property>

<property>


<value>true</value>

</property>

<property>

<name>hive.security.authorization.manager</name>

<value>org.apache.hadoop.hive.ql.security.authorization.plugin.sql</value>

</property>

<property>

<name>hive.security.authenticator.manager</name>

<value>org.apache.hadoop.hive.ql.security.SessionStateUserAuthenticator</va

lue>

</property>

<property>

<name>hive.metastore.uris</name>

<value>""</value>

<description>""(quotationmarkssurroundingasingleemptyspace).

</description>

</property>

BeforerestartingHiveServer2,theusersintheconfiguredadminrolemustrunthefollowingcommandtomaketheadminroleeffective,andthenrestartHiveServer2:

jdbc:hive2://>GRANTadminTOUSERdayongd;

www.it-ebooks.info


Thebasicsyntaxtograntorrevokeanauthorizationroleorprivilegeisasfollows:

GRANT<ROLENAME>TO<USERS>[WITHADMINOPTION];

REVOKE[ADMINOPTIONFOR]<ROLENAME>FROM<USERS>;

Here,thefollowingparametersareused:

<ROLENAME>:Thiscanbeacomma-separatednameofroles<USERS>:ThiscanbeauseroraroleWITHADMINOPTION:Thismakessurethattheusergetsprivilegestogranttheroletootherusers/roles

Anotherexampletograntorrevokeanauthorizationisasfollows:

GRANT<PRIVILEGE>ON<OBJECT>TO<USERS>;

REVOKE<PRIVILEGE>ON<OBJECT>FROM<USERS>;

Here,thefollowingparametersareused:

<PRIVILEGE>:ThiscanbeINSERT,SELECT,UPDATE,DELETE,orALL<USERS>:Thiscanbeauserorarole<OBJECT>:Thisisatableoraview

FormoreexamplesofHQLstatementstomanageSQLstandard-basedauthorization,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization#SQLStandardBasedHiveAuthorization-Configuration.

NoteSentry

Sentryisahighlymodularsystemforprovidingcentralized,fine-grained,role-basedauthorizationtobothdataandmetadatastoredonanApacheHadoopcluster.ItcanbeintegratedwithHivetodeliveradvancedauthorizationcontrols.FormoreinformationaboutSentry,pleaserefertohttp://incubator.apache.org/projects/sentry.html.

www.it-ebooks.info

https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization#SQLStandardBasedHiveAuthorization-Configuration

http://incubator.apache.org/projects/sentry.html


www.it-ebooks.info


EncryptionForsensitiveandlegallyprotecteddatasuchaspersonalidentityinformation(PII),itisrequiredtostorethedatainencryptedformatinthefilesystem.However,Hivedoesnotnativelysupportencryptionanddecryptionyet(seehttps://issues.apache.org/jira/browse/HIVE-5207).

Alternatively,wecanlookforthird-partytoolstoencryptanddecryptdataafterexportingitfromHive,butthisrequiresadditionalpostprocessing.ThenewHDFSencryption(seehttps://issues.apache.org/jira/browse/HDFS-6134)offersgreattransparentencryptionanddecryptionofdataonHDFS.ItwillsatisfyourrequestifwewanttoencryptthewholedatasetinHDFS.However,itcannotbeappliedtotheselectedcolumnandrowlevelinthetableofHive,wheremostPIIthatisencryptedisonlyapartofrawdata.Inthiscase,thebestsolutionfornowistouseHiveUDFtopluginencryptionanddecryptionimplementationsonselectedcolumnsorpartialdataintheHivetables.

SampleUDFimplementationsforencryptionanddecryptionusingtheAESencryptionalgorithmareasfollows:

AESEncrypt.java:Theimplementationisasfollows:





@Description(

name="aesencrypt",

value="_FUNC_(str)-ReturnsencryptedstringbasedonAES

key.",

extended="Example:\n"+

">SELECTaesencrypt(pii_info)FROMtable_name;\n"

)


/*

*AHiveencryptionUDF

*/

publicclassAESEncryptextendsUDF{

publicStringevaluate(Stringunencrypted){

Stringencrypted="";

if(unencrypted!=null){

try{

encrypted=CipherUtils.encrypt(unencrypted);

}catch(Exceptione){};

}

returnencrypted;

}

}

AESDecrypt.java:Thiscanbeimplementedasfollows:


www.it-ebooks.info


https://issues.apache.org/jira/browse/HDFS-6134





@Description(

name="aesdecrypt",

value="_FUNC_(str)-Returnsunencryptedstringbasedon

AESkey.",

extended="Example:\n"+

">SELECTaesdecrypt(pii_info)FROMtable_name;\n"

)


/*

*AHivedecryptionUDF

*/

publicclassAESDecryptextendsUDF{

publicStringevaluate(Stringencrypted){

Stringunencrypted=newString(encrypted);

if(encrypted!=null){

try{

unencrypted=CipherUtils.decrypt(encrypted);

}catch(Exceptione){};

}

returnunencrypted;

}

}

CipherUtils.java:Thiscanbeimplementedasfollows:


importjavax.crypto.Cipher;

importjavax.crypto.spec.SecretKeySpec;

importorg.apache.commons.codec.binary.Base64;

/*

*Thecoreencryptionanddecryptionlogicfunction

*/

publicclassCipherUtils

{

//ThisisasecretkeyintermsofASCII

privatestaticbyte[]key={

0x75,0x69,0x69,0x73,0x40,0x73,0x41,0x53,0x65,0x65,

0x72,0x69,0x74,0x4b,0x65,0x75

};

publicstaticStringencrypt(StringstrToEncrypt)

{

try

{

//preparealgorithm

Ciphercipher=Cipher.getInstance("AES/ECB/PKCS5Padding");

finalSecretKeySpecsecretKey=newSecretKeySpec(key,

"AES");

//initializecipherforencryption

cipher.init(Cipher.ENCRYPT_MODE,secretKey);

www.it-ebooks.info


//Base64.encodeBase64Stringthatgivesanasciistring

finalStringencryptedString=

Base64.encodeBase64String(cipher.doFinal(strToEncrypt.getBytes()));

returnencryptedString.replaceAll("\r|\n","");

}

catch(Exceptione)

{

e.printStackTrace();

}

returnnull;

}

publicstaticStringdecrypt(StringstrToDecrypt)

{

try

{

//preparealgorithm

Ciphercipher=Cipher.getInstance("AES/ECB/PKCS5PADDING");

finalSecretKeySpecsecretKey=newSecretKeySpec(key,

"AES");

//initializecipherfordecryption

cipher.init(Cipher.DECRYPT_MODE,secretKey);

finalStringdecryptedString=new

String(cipher.doFinal(Base64.decodeBase64(strToDecrypt)));

returndecryptedString;

}

catch(Exceptione)

{

e.printStackTrace();

}

returnnull;

}

}

NoteAES

ShortforAdvancedEncryptionStandard,AESisasymmetric128-bitblockdataencryptiontechniquedevelopedbyBelgiancryptographersJoanDaemenandVincentRijmen.Formoreinformation,pleaserefertohttp://en.wikipedia.org/wiki/Advanced_Encryption_Standard.

TodeploytheUDFandverifythem,dothefollowing:

jdbc:hive2://>ADDJAR/home/dayongd/Downloads/

.......>hiveessentials-1.0-SNAPSHOT.jar;


jdbc:hive2://>CREATETEMPORARYFUNCTIONaesdecryptAS

.......>'com.packtpub.hive.essentials.hiveudf.AESDecrypt';


jdbc:hive2://>CREATETEMPORARYFUNCTIONaesencryptAS

www.it-ebooks.info

http://en.wikipedia.org/wiki/Advanced_Encryption_Standard


.......>'com.packtpub.hive.essentials.hiveudf.AESEncrypt';


jdbc:hive2://>SELECTaesencrypt('Will')ASencrypt_name


+---------------------------+

|encrypt_name|

+---------------------------+

|YGvo54QIahpb+CVOwv9OkQ==|

+---------------------------+


jdbc:hive2://>SELECTaesdecrypt('YGvo54QIahpb+CVOwv9OkQ==')

.......>ASdecrypt_name


+---------------+

|decrypt_name|

+---------------+

|Will|

+---------------+


www.it-ebooks.info


www.it-ebooks.info


SummaryInthischapter,weintroducedthreemainareasforHivesecurity:authentication,authorization,andencryption.WecoveredtheauthenticationsinMetastoreserverandHiveServer2.Then,wetalkedaboutdefault,storage-based,andSQLstandard-basedauthorizationmethodsinHiveServer2.Attheendofthischapter,wediscussedtheuseofHiveUDFforencryptionanddecryption.Aftergoingthroughthischapter,weshouldclearlyunderstandthedifferentareasthatwillhelpusaddressHivesecurity.

Inthenextchapter,we’lltalkaboutusingHivewithothertools.

www.it-ebooks.info


www.it-ebooks.info


Chapter10.WorkingwithOtherToolsAsoneoftheearliestandmostpopularSQLoverHadooptools,Hivehasmanyusecasesofworkingwithothertoolstoofferanend-to-enddataintelligencesolution.Inthischapter,wewilldiscussthewayHiveworkswithotherbigdatatoolsinthefollowingareas:

JDBC/ODBCconnectorHBaseHueHCatalogZookeeperOozieHiveroadmap

www.it-ebooks.info


JDBC/ODBCconnectorJDBC/ODBCisoneofthemostcommonwaysforHivetoworkwithothertools.Hadoopvendors,suchasClouderaandHortonworks,offerfreeHiveJDBC/ODBCdriverssothatHivecanbeconnectedthroughthesedrivers;thesecanbefoundatthefollowinglinks:

ForCloudera,thelinkishttp://www.cloudera.com/content/cloudera/en/downloads/connectors/hive.htmlForHortonworks,thelinkishttp://hortonworks.com/hdp/addons/

WecanusetheseJDBC/ODBCconnectorstoconnectHivetotoolssuchasthefollowing:

Acommand-lineutilitysuchasBeeline,mentionedinChapter2,SettingUptheHiveEnvironmentIntegrateddevelopmentenvironmentsuchasOracleSQLDeveloper,mentionedinChapter2,SettingUptheHiveEnvironmentDataextraction,transformation,loading,andintegrationtools,suchasTalendOpenStudioBusinessintelligencereportingtools,suchasJasperReportsandQlikViewDataanalysistoolssuchasMicrosoftExcel2013DatavisualizationtoolssuchasTableau

Sincethesetupofconnectorsisverystraightforward,pleaserefertothewebsitesoftheprecedingtoolsformoredetailedinstructionstoconnecttoHive.

www.it-ebooks.info

http://www.cloudera.com/content/cloudera/en/downloads/connectors/hive.html

http://hortonworks.com/hdp/addons/


www.it-ebooks.info


HBaseHBase(seehttp://hbase.apache.org/)isahigh-performanceNoSQLkey/valuestoreonHadoop.HivehasofferedastoragehandlermechanismtointegratewithHBasebyusingtheHBaseStorageHandlerclassthatcreatesHBasetablesmanagedbyHive.ByintegratingHivewithHBase,Hiveuserscanleveragereal-timetransactionperformanceofHBasetodoreal-timebigdataanalysis.Currently,theintegrationfeatureisstillinprogress,especiallyintheareaofofferinghigherperformanceandsnapshotssupport.ThereisanotherprojectcalledPhoenix(seehttp://phoenix.apache.org/),whichprovidesbasicSQLwithhigher-performancesupportoverHBase.

AnexampleofcreatinganHBasetableinHQLisasfollows:

CREATETABLEhbase_table_sample(

idint,

value1string,

value2string,

map_valuemap<string,string>

)

STOREDBY'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITHSERDEPROPERTIES("hbase.columns.mapping"=

":key,cf1:val,cf2:val,cf3:")

TBLPROPERTIES("hbase.table.name"="table_name_in_hbase");

InthisspecialCREATETABLEstatement,theHBaseStorageHandlerclassisdelegatinginteractionwiththeHBasetablewithHiveHBaseTableInputFormatandHiveHBaseTableOutputFormat.Thehbase.columns.mappingpropertyisrequiredtomapeachtablecolumndefinedinthestatementtotheHBasetablecolumnsinorder.Forexample,theID,byorder,mapstotheHBasetable’srowkeyas:key.Sometimes,wemayneedtogeneratetheproperrowkeycolumnsusingHiveUDFsifthereisnoexistingcolumnthatcanbeusedasarowkeyfortheHBasetable.Thevalue1mapstothevalcolumninthecf1columnfamilyintheHBasetable.TheHiveMAPdatatypecanbeusedtoaccessanentirecolumnfamily.Eachrowcanhaveadifferentsetofcolumns,wherethecolumnnamescorrespondtothemapkeysandthecolumnvaluescorrespondtothemapvalues,suchasthemap_valuecolumns.Thehbase.table.nameproperty,whichisoptional,specifiesthetablenameknownbyHBase.Ifitisnotprovided,theHiveandHBasetablewillhavethesamename,suchashbase_table_sample.

NoteFormoreinformationaboutconfigurationsandfeaturesinprogressaboutHive-HBaseintegration,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/HBaseIntegration.

www.it-ebooks.info

http://hbase.apache.org/

http://phoenix.apache.org/

https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration


www.it-ebooks.info


HueHue(seehttp://gethue.com/)isshortforHadoopUserExperience.ItisawebinterfaceformakingtheHadoopecosystemeasiertouse.ForHiveusers,HueoffersaunifiedwebinterfaceforeasilyaccessingbothHDFSandHiveinaninteractiveenvironment.HuecanbeinstalledaloneorwiththeHadoopvendorpackages.Inaddition,Hueaddsmoreprogramming-friendlyfeaturestoHive,suchasthefollowing:

HighlightsHQLkeywordsAutocompletesHQLqueryOffersliveprogressandlogsforHiveandMapReducejobsSubmitsseveralqueriesandchecksprogresslaterBrowsesdatainHivetablesthroughawebuserinterfaceNavigatesthroughthemetadataRegistersUDFandaddsfiles/archivesthroughawebuserinterfaceSaves,exports,andsharesthequeryresultCreatesvariouschartsfromthequeryresult

ThefollowingisascreenshotoftheHiveeditorinterfaceinHue:

HueHiveeditoruserinterface

www.it-ebooks.info

http://gethue.com/


www.it-ebooks.info


HCatalogHCatalog(seehttps://cwiki.apache.org/confluence/display/Hive/HCatalog)isametadatamanagementsystemforHadoopdata.ItstoresconsistentschemainformationforHadoopecosystemtools,suchasPig,Hive,andMapReduce.Bydefault,HCatalogsupportsdataintheformatofRCFile,CSV,JSON,SequenceFile,ORCfile,andacustomizedformatifInputFormat,OutputFormat,andSerDeareimplemented.ByusingHCatalog,usersareabletodirectlycreate,edit,andexpose(viaitsRESTAPI)metadata,whichbecomeseffectiveimmediatelyinalltoolssharingthesamepieceofmetadata.Atfirst,HCatalogwasaseparateApacheprojectfromHiveandwaspartofApacheIncubator,wheremostApacheprojectsfirststarted.Eventually,HCatalogbecameapartoftheHiveprojectin2013startingwithHive0.11.0.

HCatalogisbuiltontopoftheHivemetastoreandincorporatessupportforHiveDDL.ItprovidesreadandwriteinterfacesandHCatLoaderandHCatStorer,forPig,byimplementingPig’sloadandstoreinterfaces,respectively.HCatalogalsoprovidesaninterfaceforMapReduceprogramsbyusingHCatInputFormatandHCatOutputFormat,whichareverysimilartoothercustomizedformatsbyimplementingHadoop’sInputFormatandOutputFormat.HCatalogprovidesaRESTAPIfromacomponentcalledWebHCatsothatHTTPrequestscanbemadetoaccessthemetadataofHadoopMapReduce/Yarn,Pig,Hive,andHCatalogDDLfromotherapplications.ThereisnoHive-specificinterfacesinceHCatalogusesHive’smetastore.Therefore,HCatalogcandefinemetadataforHivedirectlythroughitsCLI.TheHCatalogCLIsupportstheHQLSHOW/DESCRIBEstatementandthemajorityofHiveDDL,exceptthefollowingstatements,thatrequirerunningMapReducejobs:

CREATETABLE…ASSELECT

ALTERINDEX…REBUILD

ALTERTABLE…CONCATENATE

ALTERTABLEARCHIVE/UNARCHIVEPARTITION

ANALYZETABLE…COMPUTESTATISTICS

IMPORT/EXPORT

www.it-ebooks.info

https://cwiki.apache.org/confluence/display/Hive/HCatalog


www.it-ebooks.info


ZooKeeperZooKeeper(seehttp://zookeeper.apache.org/)isacentralizedserviceforconfigurationmanagementandthesynchronizationofvariousaspectsofnamingandcoordination.Itmanagesanamingregistryandeffectivelyimplementsasystemformanagingthevariousstaticallyanddynamicallynamedobjectsinahierarchicalsystem.Italsoenablescoordinationandcontroltothesharedresources,suchasfilesanddata,whicharemanipulatedbymultipleconcurrentprocesses.

UnlikeRDBMS,Hivedoesnotnativelysupportconcurrencyaccessandlockingmechanisms.HivereliesonZooKeeperforlockingthesharedresourcessinceHive0.7.0.TherearetwotypesoflocksprovidedbyHivethroughZookeeperandtheyareasfollows:

Sharedlock:Thisisacquiredwhenatable/partitionisread.TheconcurrentsharedlocksareallowedinHive.Exclusivelock:Thisisacquiredforallotheroperationsthatmodifythetable.Forpartitiontables,onlyasharedlockisacquiredifthechangeisonlyapplicabletothenewly-createdpartitions.Anexclusivelockisacquiredonthetableifthechangeisapplicabletoallpartitions.Inaddition,anexclusivelockonthetablegloballyaffectsallpartitions.

AnyHQLmustacquireproperlocksbeforebeingallowedtoperformcorrespondinglock-permittedoperations.

ToenablelockinginHive,weneedtomakesureZooKeeperisinstalledandconfigured.Then,configurethefollowingpropertiesinHive’shive-site.xmlfile:

<property>

<name>hive.support.concurrency</name>

<description>EnableHive'sTableLockManagerService</description>

<value>true</value>

</property>

<property>

<name>hive.zookeeper.quorum</name>

<description>CommaseparatedZookeeperquorumusedbyHive'sTableLock

Manager.</description>

<value>localhost.localdomain</value>

</property>

WecanalsosetthefollowingpropertytousethenewlockmanagerfortransactionssupportsinceHive0.13.0:

<property>

<name>hive.txn.manager</name>

<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>

</property>

NoteOnceconfigured,wecanfurthersetlockingproperties,specifiedanddetailedathttps://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-

www.it-ebooks.info

http://zookeeper.apache.org/

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Locking


Locking.

Locksareeitherimplicitlyacquired/releasedfromHQLorexplicitlyacquired/releasedusingtheLOCKandUNLOCKstatementsasfollows:

--Locktableandspecifylocktype

jdbc:hive2://>LOCKTABLEemployeeshared;


--Showthelockinformationonthespecifictables

jdbc:hive2://>SHOWLOCKSemployeeEXTENDED;

+------------------------------------------------------------------------+-

----+

|tab_name|

mo|

+------------------------------------------------------------------------+-

----+

|default@employee|

SHA|

|LOCK_QUERYID:hive_20150105170303_792598b1-0ac8-4aad-aa4e-c4cdb0de6697|

|

|LOCK_TIME:1420495466554|

|

|LOCK_MODE:EXPLICIT|

|

|LOCK_QUERYSTRING:LOCKTABLEemployeeshared|

|

+------------------------------------------------------------------------+-

----+


--Releasethelockonthetable

jdbc:hive2://>UNLOCKTABLEemployee;


--Showalllocksinthedatabase

jdbc:hive2://>SHOWLOCKS;

+-----------+-------+

|tab_name|mode|

+-----------+-------+

+-----------+-------+


jdbc:hive2://>LOCKTABLEemployeeexclusive;


jdbc:hive2://>SHOWLOCKSemployeeEXTENDED;

+------------------------------------------------------------------------+-

----+

|tab_name|

mo|

+------------------------------------------------------------------------+-

----+

|default@employee|

EXC|

|LOCK_QUERYID:hive_20150105170808_bbc6db18-e44a-49a1-bdda-3dc30b5c8cee|

www.it-ebooks.info


|

|LOCK_TIME:1420495807855|

|

|LOCK_MODE:EXPLICIT|

|

|LOCK_QUERYSTRING:LOCKTABLEemployeeexclusive|

|

+------------------------------------------------------------------------+-

----+



Whenthetableacquiresanexclusivelock,theprecedingSELECTstatementwillwaitforthelockandshownothingasaresultsetunlessweunlockthetableintheothersession.FromtheHivelog,wecanfindthefollowinginformationthatspecifiesthattheSELECTstatementiswaitingtogetthereadlock:

15/01/0517:13:39INFOql.Driver:<PERFLOGmethod=acquireReadWriteLocks>

15/01/0517:13:39ERRORZooKeeperHiveLockManager:conflictinglockpresent

fordefault@employeemodeSHARED

NoteFormoreinformationaboutusingZooKeeperforHivelocks,pleaserefertotheApacheHivewikiathttps://cwiki.apache.org/confluence/display/Hive/Locking.

www.it-ebooks.info

https://cwiki.apache.org/confluence/display/Hive/Locking


www.it-ebooks.info


OozieOozie(seehttp://oozie.apache.org/)isanopensourceworkflowcoordinationandscheduleservicetomanagedataprocessingjobs.OozieworkflowjobsaredefinedinaseriesofnodesinaDirectedAcyclicalGraph(DAG).Acyclicalheremeansthattherearenoloopsinthegraphandallnodesinthegraphflowinonedirectionwithoutgoingback.Oozieworkflowscontaineitherthecontrolflownodeoractionnode:

Controlflownode:Thiseitherdefinesthestart,end,andfailednodeinaworkfloworcontrolstheworkflowexecutionpathsuchasdecision,fork,andjoinnodes.Actionnode:ThisdefinesthecoredataprocessingactionjobsuchasMapReduce,Hadoopfilesystem,Hive,Pig,Java,Shell,e-mail,andOoziesubworkflows.Additionaltypesofactionsarealsosupportedbydevelopingextensions.

Oozieisascalable,reliable,andextensiblesystem.Itcanbeparameterizedforworkflowsubmissionandscheduledtorunautomatically.Therefore,Oozieisverysuitableforlightweightdataintegrationormaintenancejobs.

HueoffersveryfriendlyandpowerfulsupportforOoziethroughtheOozieeditor.CreatingandsubmittinganOozieworkflowofHiveactionsfromHueisasstraightforwardasthefollowingsteps:

1. LogintoHueandselectfromthetopmenubarWorkflows|Editors|WorkflowstoopenWorkflowManager.

2. ClickontheCreatebuttontocreateaworkflow.3. Giveaproperworkflownameandsavetheworkflow.4. Oncetheworkflowissaved,theOozieeditorwindowappearsforfurthersettings.5. DragaHiveactiontothemiddleofthestartandendnodes.6. IntheEditNode:menushown,thefollowingsettingsarepresent.Provideproper

settingsasfollows:

Name:Giveaproperactionname.Description:Thisiswheretodescribethejob.Thisisoptional.Advanced:ThisisforSLAmonitoring.Thisisoptional.Scriptname:ChoosetheHQLscriptsfromHDFSforHiveaction.Prepare:Defineactions,suchasdeletefilesorcreatefolders,beforerunningthescript.Thisisoptional.Parameters:Thisdefinestheparameterstobetakenwhensubmittingthejob(suchas${date}).Thisisoptional.Jobproperties:ThisiswheretosetHadoop/Hiveproperties.Thisisoptional.Files:Thisiswheretoselectthefilesneededforthescripts.Thisisoptional.Archives:ThisiswheretoselectthearchivefilessuchasUDFJARs.Thisisoptional.JobXML:Chooseacopyofthehive-site.xmlfileoftheHiveclusterfromHDFSsothatOoziecanconnecttotheHivemetastore.

7. ClickonDoneintheEditNode:menuandthenclickonSaveinWorkflowEditor.

www.it-ebooks.info

http://oozie.apache.org/


8. ClickonSubmittosubmittheworkflow.Then,theHiveactionistriggeredbytheOozieworkflowsuccessfully.

www.it-ebooks.info


www.it-ebooks.info


HiveroadmapAsitistheendofthischapteraswellasofthisbook,thehighlightofeachHivereleasemilestoneandfuturefeaturesexpectedaresummarizedasfollowsalongwithbestwishestotheHivecommunitiesforgrowingbiggerandbetterinthenearfuture:

December2011–Hive0.8.0

AddedBitmapindexesAddedtheTIMESTAMPdatatypeAddedtheHivePluginDeveloperKittomakepluginbuildingandtestingeasierImprovedJDBCDriverandbugfixes

April2012–Hive0.9.0

AddedtheCREATEORREPLACEVIEWstatementAddedNOTINandNOTLIKEsupportAddedtheBETWEENandNULLsafeequalityoperatorAddedprintf(),sort_array(),andconcat_ws()functionsAddedafilterpush-downfromHiveintoHBaseforthekeycolumnCombinedmultipleUNIONALLstatementsinoneMapReducejobCombinedmultipleGROUPBYstatementsonthesamedatawiththesamekeysinoneMapReducejob

January2013–Hive0.10.0

AddedtheCUBEandROLLUPstatementsAddedbettersupportforYARNAddedmoreinformationintheEXPLAINstatementAddedtheSHOWCREATETABLEstatementAddedbuilt-insupportforreading/writingAvrodataAddedimprovementsforskewedjoinsImprovedsimplequerieswithoutrunningMapReducejobsfaster

May2013–Hive0.11.0asStingerPhase1

AddedORCforbetterperformanceAddedanalyticandwindowsfunctionsAddedHCatalogaspartofHiveAddedGROUPBYcolumnpositionsImproveddatatypesandaddedtheDECIMALdatatypeImprovedjoinsforbroadcastandSMBjoinsImplementedHiveServer2

October2013–Hive0.12.0asStingerPhase2

AddedVARCHARandDATEsupportAddedparallelORDERBYtoHiveAddedmoreimprovementsforORC,suchaspredicatepush-downAddedacorrelationoptimizer

www.it-ebooks.info


AddedsupportforGROUPBYontheSTRUCTtypeAddedsupportfortheouterlateralviewPushedLIMITdowntomappers

April2014–Hive0.13.0asStingerPhase3Final

AddedDECIMALandCHARdatatypesAddedsupportforrunningjobsonTezAddedavectorizedqueryengineAddedsupportforsubqueriesforIN,NOTIN,EXISTS,andNOTEXISTSAddedsupportforpermanentfunctionsAddedsupportforcommontableexpressionsAddedSQLstandard-basedauthentication

November2014–Hive0.14.0asStinger.nextPhase1

AddedtransactionswithACIDsemanticsAddedaCostBaseOptimizer(CBO)AddedtheCREATETEMPORARYTABLEstatementAddedsupportfortheSTOREDASAVROintheCREATETABLEstatementAddedskipTrashconfigurationfortheDROPTABLEstatementAddedAccumuloStorageHandleUsedTezautoparallelisminHive

February2015–Hive1.0.0

Movedtoa1.x.yreleasenamingstructureMadeHiveMetaStoreClientapublicAPIRemovalofHiveServer1SwitchedtoTez0.5.2

Future

OffersubsecondquerieswithLiveLongAndProcess(LLAP)OfferHiveoverSparkSupportSQL2011analyticsSupportcross-geoqueriesOffermaterializedviewsOfferworkloadmanagementviaYARNandLLAPintegrationHiveasaunifieddataquerytool

www.it-ebooks.info


www.it-ebooks.info


SummaryInthisfinalchapter,weintroducedsomebigdatatools,whichcanworkwithHivethroughJDBCorODBCintegration,suchasHBase,Hue,HCatalog,ZooKeeper,andOozie.Then,wereviewedthekeyreleasesofHivefrom0.8.0to1.0.0,aswellastheexcitingfeaturesexpectedinthefuture.Aftergoingthroughthischapter,weshouldunderstandhowtouseotherbigdatatoolswithHivetoprovideend-to-enddataintelligencesolutions.

www.it-ebooks.info


IndexA

Abstractsyntaxtree(AST)about/TheEXPLAINstatement

ACLsonHDFS,URL/Storage-basedmode

AdvancedEncryptionStandard(AES)URL/Encryption

aggregatefunctions/Operatorsandfunctionsaggregation

dataaggregation/Basicaggregation–GROUPBYwithoutGROUPBYcolumns/Basicaggregation–GROUPBYwithGROUPBYcolumns/Basicaggregation–GROUPBYadvanced/Advancedaggregation–GROUPINGSETS,Advancedaggregation–ROLLUPandCUBEROLLUPstatement/Advancedaggregation–ROLLUPandCUBECUBEstatement/Advancedaggregation–ROLLUPandCUBEcondition,HAVINGstatement/Aggregationcondition–HAVING

AmazonEMRURL/StartingHiveinthecloud

analyticfunctionsabout/AnalyticfunctionsFunction(arg1,…,argn)/AnalyticfunctionsStandardaggregations/AnalyticfunctionsRANK/AnalyticfunctionsDENSE_RANK/AnalyticfunctionsROW_NUMBER/AnalyticfunctionsCUME_DIST/AnalyticfunctionsPERCENT_RANK/AnalyticfunctionsNTILE/AnalyticfunctionsLEADfunction/AnalyticfunctionsLAGfunction/AnalyticfunctionsFIRST_VALUE/AnalyticfunctionsLAST_VALUE/Analyticfunctionswindowexpressions/Analyticfunctions

ANALYZEstatementabout/TheANALYZEstatement

ANTLRURL/TheEXPLAINstatement

Apacheused,forinstallingHive/InstallingHivefromApache

ApacheHive

www.it-ebooks.info


Wiki,URL/UsingtheHivecommandlineandBeelineApacheHiveWiki

URL/HBaseApacheJIRAHive-365

URL/UnderstandingHivedatatypesAtomicity,Consistency,Isolation,andDurability(ACID)

about/Transactionsauthentication

about/AuthenticationMetastoreserverauthentication/MetastoreserverauthenticationHiveServer2authentication/HiveServer2authentication

authorizationabout/Authorizationlegacymode/Legacymodestorage-basedmode/Storage-basedmodeSQLstandard-basedmode/SQLstandard-basedmode

AvroURL/SerDe

AvroSerDe/SerDeAzureHDInsightService

URL/StartingHiveinthecloud

www.it-ebooks.info


Bbatchprocessing

about/Batch,real-time,andstreamprocessingBeeline

using/UsingtheHivecommandlineandBeelineURL/UsingtheHivecommandlineandBeelinecommand-linesyntax/UsingtheHivecommandlineandBeeline

bigdataabout/IntroducingbigdataVolume/Introducingbigdatavolume/Introducingbigdatavelocity/Introducingbigdatavariety/Introducingbigdataveracity/Introducingbigdatavariability/Introducingbigdatavolatility/Introducingbigdatavisualization/Introducingbigdatavalue/Introducingbigdata

blocksampling/Samplingbucketmapjoin/Bucketmapjoinbuckets

about/Hivebucketsnumber/Hivebuckets

buckettablesabout/Buckettables

buckettablesampling/Sampling

www.it-ebooks.info


Ccloud

Hive,starting/StartingHiveinthecloudCloudera

URL/StartingHiveinthecloudabout/JDBC/ODBCconnector

ClouderaDistributedHadoop(CDH)URL/InstallingHivefromvendorpackages

CLUSTERBY/ORDERandSORTcollectionfunctions/Operatorsandfunctionscollectionitemdelimiter/UnderstandingHivedatatypesColumnarSerDe/SerDeCombineFileInputFormat/Storageoptimizationcommonjoin,joinoptimization/CommonjoinCommonTableExpression(CTE)/HiveinternalandexternaltablesCommonTableExpression(CTE)/Hiveinternalandexternaltablescompression/Compressionconditionalfunctions/OperatorsandfunctionsCost-BasedOptimizer(CBO)

about/TheANALYZEstatementCostBaseOptimizer(CBO)/HiveroadmapCREATETABLE/HiveinternalandexternaltablesCreatethetableasselect(CTAS)/HiveinternalandexternaltablesCROSSJOINstatement/TheOUTERJOINandCROSSJOINstatementsCUBEstatement

about/Advancedaggregation–ROLLUPandCUBE

www.it-ebooks.info


Ddataaggregation

about/Basicaggregation–GROUPBYdatabase,Hive

about/Hivedatabasedataexchange

LOADkeyword/Dataexchange–LOADINSERTkeyword/Dataexchange–INSERT

dataexchangeEXPORTstatement/Dataexchange–EXPORTandIMPORTIMPORTstatement/Dataexchange–EXPORTandIMPORT

datafileoptimizationabout/Datafileoptimizationfileformat/Fileformatcompression/Compressionstorageoptimization/Storageoptimization

datatypeconversionsabout/Datatypeconversionsprimitivetypeconversion/Datatypeconversionsexplicittypeconversion/Datatypeconversions

datatypefunctionstips,complex/Operatorsandfunctionsdatatypes,Hive

about/UnderstandingHivedatatypesTINYINT/UnderstandingHivedatatypesSMALLINT/UnderstandingHivedatatypesINT/UnderstandingHivedatatypesBIGINT/UnderstandingHivedatatypesFLOAT/UnderstandingHivedatatypesDOUBLE/UnderstandingHivedatatypesDECIMAL/UnderstandingHivedatatypesBINARY/UnderstandingHivedatatypesBOOLEAN/UnderstandingHivedatatypesSTRING/UnderstandingHivedatatypesCHAR/UnderstandingHivedatatypesVARCHAR/UnderstandingHivedatatypesDATE/UnderstandingHivedatatypesTIMESTAMP/UnderstandingHivedatatypes

datefunctions/Operatorsandfunctionsdatefunctiontips/Operatorsandfunctionsdelimiters

rowdelimiter/UnderstandingHivedatatypescollectionitemdelimiter/UnderstandingHivedatatypesmapkeydelimiter/UnderstandingHivedatatypes

www.it-ebooks.info


deployment/DevelopmentanddeploymentDerby

URL/InstallingHivefromApachedesignoptimization

about/Designoptimizationpartitiontables/Partitiontablesbuckettables/Buckettablesindex/Index

development/DevelopmentanddeploymentDirectedAcyclicalGraph(DAG)/Ooziedirectedacyclicgraphs(DAGs)/IndexDISTRIBUTEBY/ORDERandSORT

www.it-ebooks.info


Eencryption

about/EncryptionEXPLAINstatement

about/TheEXPLAINstatementEXTENDEDkeyword/TheEXPLAINstatementDEPENDENCYkeyword/TheEXPLAINstatementAUTHORIZATIONkeyword/TheEXPLAINstatement

explicittypeconversion/DatatypeconversionsEXPORTstatement/Dataexchange–EXPORTandIMPORTexternaltables

about/Hiveinternalandexternaltables/Hiveinternalandexternaltables

www.it-ebooks.info


Ffileformat,datafileoptimization

about/FileformatTEXTFILE/FileformatSEQUENCEFILE/FileformatRCFILE/FileformatOptimizedRowColumnar(ORC)/FileformatPARQUET/Fileformat

Flume/OverviewoftheHadoopecosystemfunctions

about/Operatorsandfunctionsmathematicalfunctions/Operatorsandfunctionscollectionfunctions/Operatorsandfunctionstypeconversionfunctions/Operatorsandfunctionsdatefunctions/Operatorsandfunctionsconditionalfunctions/Operatorsandfunctionsstringfunctions/Operatorsandfunctionsaggregatefunctions/Operatorsandfunctionstable-generatingfunctions/Operatorsandfunctionscustomized/Operatorsandfunctionscomplexdatatypefunctionstips/Operatorsandfunctionsdatefunctiontips/OperatorsandfunctionsCASE,fordatatypes/Operatorsandfunctionsparserandsearchtips/Operatorsandfunctionsvirtualcolumns/Operatorsandfunctions

www.it-ebooks.info


GGenericUDAF

URL/TheUDAFcodetemplateGROUPINGSETSkeyword

about/Advancedaggregation–GROUPINGSETS

www.it-ebooks.info


HHadoop

versusrelationaldatabase/RelationalandNoSQLdatabaseversusHadoopversusNoSQLdatabase/RelationalandNoSQLdatabaseversusHadoop

HadoopArchiveandHAR/Storageoptimization

HadoopArchiveFile(HAR)/FileformatHadoopecosystem

about/OverviewoftheHadoopecosystemHAVINGstatement

about/Aggregationcondition–HAVINGHBase

about/HBaseURL/HBasetable,creatinginHQL/HBase

HBaseSerDe/SerDeHCatalog

about/HCatalogURL/HCatalog

HDFSabout/Batch,real-time,andstreamprocessing,OverviewoftheHadoopecosystem

HDFSfederation/StorageoptimizationHive

about/Hiveoverviewinstalling,fromApache/InstallingHivefromApacheURL/InstallingHivefromApacheinstalling,fromvendorpackages/InstallingHivefromvendorpackagesstarting,incloud/StartingHiveintheclouddatatypes/UnderstandingHivedatatypescomplextypes/UnderstandingHivedatatypestypes/UnderstandingHivedatatypesdatabase/Hivedatabaseinternaltables/Hiveinternalandexternaltablesexternaltables/Hiveinternalandexternaltablespartitions/Hivepartitionsbuckets/Hivebucketsviews/Hiveviewsperformanceutilities/Performanceutilities

Hive,complextypesARRAY/UnderstandingHivedatatypesMAP/UnderstandingHivedatatypesSTRUCT/UnderstandingHivedatatypes

www.it-ebooks.info


NAMEDSTRUCT/UnderstandingHivedatatypesUNION/UnderstandingHivedatatypes

Hive-integrateddevelopmentenvironment(IDE)about/TheHive-integrateddevelopmentenvironment

hive.map.aggrproperty/Basicaggregation–GROUPBYHiveCLI

command-linesyntax/UsingtheHivecommandlineandBeelineURL/UsingtheHivecommandlineandBeeline

Hivecommandlineusing/UsingtheHivecommandlineandBeeline

HiveDataDefinitionLanguage(DDL)about/HiveDataDefinitionLanguage

HivejoinoptimizationURL/Skewjoin

Hiveroadmapabout/Hiveroadmap

HiveServer2URL/UsingtheHivecommandlineandBeeline

HiveServer2authenticationnoneauthentication/HiveServer2authenticationKerberosauthentication/HiveServer2authenticationLDAPauthentication/HiveServer2authenticationpluggablecustomauthentication/HiveServer2authenticationPluggableAuthenticationModules(PAM)authentication/HiveServer2authentication

HiveWikiURL/Operatorsandfunctions

HortonworksURL/JDBC/ODBCconnector

HQLabout/Hiveoverview

HueURL/TheHive-integrateddevelopmentenvironment,Hueabout/Hue

www.it-ebooks.info


IImpala

URL/AshorthistoryIMPORTstatement/Dataexchange–EXPORTandIMPORTindex

about/IndexINNERJOINstatement/TheINNERJOINstatementINSERTkeyword/Dataexchange–INSERTinternaltables

about/Hiveinternalandexternaltables/Hiveinternalandexternaltables

www.it-ebooks.info


JJavaIDE

URL/DevelopmentanddeploymentJavaVirtualMachine(JVM)/Batch,real-time,andstreamprocessingjavax.scriptAPI

URL/User-definedfunctionsJDBC/ODBCconnector

about/JDBC/ODBCconnectorjobandqueryoptimization

about/Jobandqueryoptimizationlocalmode/LocalmodeJVMreuse/JVMreuseparallelexecution/Parallelexecution

joinoptimizationabout/Joinoptimizationcommonjoin/Commonjoinmapjoin/Mapjoinbucketmapjoin/BucketmapjoinSortmergebucket(SMB)join/Sortmergebucket(SMB)joinSortmergebucketmap(SMBM)join/Sortmergebucketmap(SMBM)joinskewjoin/Skewjoin

JSONSerDeURL/SerDeabout/SerDe

JVMreuse,jobandqueryoptimization/JVMreuse

www.it-ebooks.info


KKerberos

about/AuthenticationKerberosauthentication/HiveServer2authenticationKeyDistributionCenter(KDC)/Authentication

www.it-ebooks.info


LLazySimpleSerDe/SerDeLDAPauthentication/HiveServer2authenticationlegacymode,authorization

about/LegacymodeLiveLongAndProcess(LLAP)/HiveroadmapLOADkeyword/Dataexchange–LOADlocalmode,jobandqueryoptimization/Localmode

www.it-ebooks.info


Mmapjoin,joinoptimization/MapjoinMAPJOINstatement/SpecialJOIN–MAPJOINmapkeydelimiter/UnderstandingHivedatatypesmathematicalfunctions/OperatorsandfunctionsMaven

URL/Developmentanddeploymentmetastore/HiveoverviewMetastoreserverauthentication

about/MetastoreserverauthenticationMITKerberos

URL/AuthenticationMySQL

URL/InstallingHivefromApache

www.it-ebooks.info


Nnoneauthentication/HiveServer2authenticationNoSQLdatabase

versusHadoop/RelationalandNoSQLdatabaseversusHadoop

www.it-ebooks.info


OOozie

about/OozieURL/Ooziecontrolflownode/Oozieactionnode/Oozie

OpenCSVSerDe/SerDeoperators

about/OperatorsandfunctionsOptimizedRowColumnar(ORC)/Index,FileformatOptimizedRowColumnar(ORC)file

about/TransactionsORDERBY(ASC|DESC)keyword/ORDERandSORTORDERkeyword/ORDERandSORTOUTERJOINstatement/TheOUTERJOINandCROSSJOINstatementsOutOfMemory(OOM)exceptions/TheINNERJOINstatement

www.it-ebooks.info


Pparallelexecution,jobandqueryoptimization/ParallelexecutionParquetHiveSerDe/SerDeparserandsearchtips/OperatorsandfunctionsPARTITIONBYstatement/Analyticfunctionspartitions

about/Hivepartitionspartitiontables

bydateandtime/Partitiontablesbylocations/Partitiontablesbybusinesslogics/Partitiontables

personalidentityinformation(PII)about/Encryption

PhoenixURL/HBase

PluggableAuthenticationModules(PAM)authentication/HiveServer2authenticationpluggablecustomauthentication/HiveServer2authenticationPostgreSQL

URL/InstallingHivefromApachePresto

URL/Ashorthistoryprimitivetypeconversion/DatatypeconversionsProcessingElements(PE)/Batch,real-time,andstreamprocessing

www.it-ebooks.info


Rrandomsampling

URL/Samplingreal-timeprocessing

about/Batch,real-time,andstreamprocessingRecordColumnarFile(RCFILE)/FileformatRegexSerDe/SerDerelationaldatabase

versusHadoop/RelationalandNoSQLdatabaseversusHadoopROLLUPstatement

about/Advancedaggregation–ROLLUPandCUBErowdelimiter/UnderstandingHivedatatypes

www.it-ebooks.info


Ssampling

about/Samplingrandomsampling/Samplingbuckettablesampling/Samplingblocksampling/Sampling

SELECT*statement/TheSELECTstatementSELECTstatement/TheSELECTstatementSentry

URL/SQLstandard-basedmodeSequenceFileformat/StorageoptimizationSerDe

about/SerDedata,reading/SerDedata,writing/SerDeLazySimpleSerDe/SerDeColumnarSerDe/SerDeRegexSerDe/SerDeHBaseSerDe/SerDeAvroSerDe/SerDeParquetHiveSerDe/SerDeOpenCSVSerDe/SerDeJSONSerDe/SerDe

SHOWTRANSACTIONScommand/TransactionsSimpleAuthenticationandSecurityLayer(SASL)framework/Metastoreserverauthenticationskewjoin/SkewjoinSORTBY(ASC|DESC)keyword/ORDERandSORTSORTkeyword/ORDERandSORTsortmergebucket(SMB)join/Sortmergebucket(SMB)joinsortmergebucketmap(SMBM)join/Sortmergebucketmap(SMBM)joinSpark/OverviewoftheHadoopecosystemSQLLine

URL/UsingtheHivecommandlineandBeelineSQLstandard-basedmode,authorization

about/SQLstandard-basedmodeSqoop/OverviewoftheHadoopecosystemstagedependencies

about/TheEXPLAINstatementstageplans

about/TheEXPLAINstatementstorage-basedmode,authorization

about/Storage-basedmode

www.it-ebooks.info


storageoptimization/StorageoptimizationStorm

URL/Ashorthistory,Batch,real-time,andstreamprocessingstreaming

about/Streamingstreamprocessing

about/Batch,real-time,andstreamprocessingstringfunctions/OperatorsandfunctionsStructuredQueryLanguage(SQL)

about/Ashorthistory

www.it-ebooks.info


Ttable-generatingfunctions/OperatorsandfunctionsTez/OverviewoftheHadoopecosystem

about/IndexURL/Index

transactionsabout/Transactions

typeconversionfunctions/Operatorsandfunctions

www.it-ebooks.info


UUDAF

code,template/TheUDAFcodetemplateUDAFs

about/User-definedfunctionsUDF

code,template/TheUDFcodetemplateUDFs

about/User-definedfunctionsUDTF

code,template/TheUDTFcodetemplateUDTFs

about/User-definedfunctionsUniformResourceIdentifier(URI)/Dataexchange–LOADUNIONALLstatement/Setoperation–UNIONALL

www.it-ebooks.info


Vvalue/Introducingbigdatavariability/Introducingbigdatavariety/IntroducingbigdataVectorizationoptimization

about/IndexURL/Index

velocity/Introducingbigdatavendorpackages

used,forinstallingHive/InstallingHivefromvendorpackagesveracity/Introducingbigdataviews

about/Hiveviewsaltering/Hiveviewsredefining/Hiveviewsdropping/Hiveviews

virtualcolumns/Operatorsandfunctionsvisualization/Introducingbigdatavolatility/Introducingbigdatavolume/Introducingbigdata

www.it-ebooks.info


WWHEREclauses

subqueries,restrictions/TheSELECTstatementwindowexpressions

BETWEEN…ANDclause/AnalyticfunctionsNPRECEDINGorFOLLOWING/AnalyticfunctionsUNBOUNDEDPRECEDING/AnalyticfunctionsUNBOUNDEDFOLLOWING/AnalyticfunctionsUNBOUNDEDPRECEDINGANDUNBOUNEDFOLLOWING/AnalyticfunctionsCURRENTROW/AnalyticfunctionsURL/Analyticfunctions

www.it-ebooks.info


YYarn/OverviewoftheHadoopecosystem

www.it-ebooks.info


ZZooKeeper

about/ZooKeeperURL/ZooKeepersharedlock/ZooKeeperexclusivelock/ZooKeeperforHivelocks,URL/ZooKeeper

www.it-ebooks.info


apache hive essentials - droppdf2.droppdf.com/files/i9ekz/apache-hive-essentials.pdf · neha...

Documents