big data with apache hadoop
TRANSCRIPT
Ing.AlessandroBruno– Ph.D.Research Fellow@CVIPGroup
PalermoUniversity
BIGDataAnalyticswithHadoop
ADI- FreeSoftware&ScientificResearch Session
Palermo24-26th June 2016
BigDataAnalytics
BigDataAnalytics
l“Inpioneerdaystheyusedoxenforheavypulling,andwhenoneoxcouldn’tbudgealog,theydidn’ttrytogrowalargerox.Weshouldbetryingforbiggercomputers,butformoresystemsofcomputers”.GraceHopper
BigDataAnalyticsl BigDataisdrivingradicalchangesintraditionaldataanalysisplatforms
l Performinganykindofanalysisinvoluminousandcomplexdata
l Scalingupthehardwareplatformsbecomesimminentandchoosingtherighthardware/softwareplatformsbecomesacrucialdecision
l Satisfyrequestinreasonableamountoftime
[link] [link] [link]
BigDataAnalytics
l Theabilityofbigdataplatformstoadpattoincreaseddataprocessingdemandsplaysacriticalrole:Isitappropriatetobuildtheanalyticsbasedsolutionsonaparticularplatform?
l Whatarethefundamentalissuesbeforemakingtherightdecisions?
FundamentalIssuesl Howquicklydoweneedtogettheresults?
l HowBigistheDatatobeprocessed?
l Doesthemodelbuildingrequireseveraliterationsor asingleiteration?
l Willtherebeaneedformoredataprocessingcapabilityinthefuture?
l Isdatatransfercriticalforthisapplication?
l Isthereanyneedforhandlinghardwarefailureswithintheapplications?
SomeexamplesofBigDatal 1.06billionmonthlyactiveusersonfacebookand680millionmobileusers
l Onanaverage,3.2billionlikesandcommentsarepostedeverydayonFacebook.
l 72%ofwebaudienceisonFacebook.
l FacebookstartedusingHadoopinmid-2009andwasoneoftheinitialusersofHadoop
l Facebookisgenerating500+terabytesofdataperday
SomeexamplesofBigDatal NYSE(NewYorkStockExchange)generatesabout1terabyteofnewtradedataperday,ajetairlinecollects10terabytes ofcensordataforevery 30minutes offlyingtime
l Googleprocessesover40,000searchquerieseverysecondonaverage,whichtranslatestoover3.5billionsearchesperdayand1.2trillionsearchesperyearworldwide
VforBIGDATA
Volume Variety Velocity Veracity Value
VforBigDatal Volume,Velocity,Variety,Veracity,Value
l Volumereferstothevastamountsofdatageneratedeverysecond.Justthinkofalltheemails,twittermessages,photos,videoclips,sensordataetc.weproduceandshareeverysecond
l Velocityreferstothespeedatwhichnewdataisgeneratedandthespeedatwhichdatamovesaround.Justthinkofsocialmediamessagesgoingviralinseconds,thespeedatwhichcreditcardtransactionsarecheckedforfraudulentactivities
VforBigDatal Varietyreferstothedifferenttypesofdatawecannowuse.Inthepastwefocusedonstructureddatathatneatlyfitsintotablesorrelationaldatabases,suchasfinancialdata(e.g.salesbyproductorregion).Infact,80%oftheworld’sdataisnowunstructured,andthereforecan’teasilybeputintotables(thinkofphotos,videosequencesorsocialmediaupdates)
l Veracityreferstothemessinessortrustworthinessofthedata.Withmanyformsofbigdata,qualityandaccuracyarelesscontrollable(justthinkofTwitterpostswithhashtags,abbreviations,typosandcolloquialspeechaswellasthereliabilityandaccuracyofcontent)
VforBigDatal Value: ThereisstillanotherVtotakeintoaccountwhenlookingatBigData:Value!Youarewellandgoodhavingaccesstobigdatabutunlesswecanturnitintovalueitisuseless.Soyoucansafelyarguethat'value'isthemostimportantVofBigData
Scalingl Scalingistheabilityofthesystemtoadapttoincreaseddemandsintermsofdataprocessing
l HORIZONTALSCALING(SCALEOUT) :Itinvolvesdistributing theworkloadacrossmanyserverswhichmaybecommoditymachines.Furthermore, multipleindependentmachinesareaddedtogetherinorder toimproveprocessingcapability
l VERTICALSCALING(SCALEUP):Itinvolvesinstallingmoreprocessors,morememoryandfasterhardware,generallywithinasingleserver
Scalingl DrawbacksandAdvantagesofHorizontalandVerticalScaling
HorizontalScalingl Scalingoutplatformsincludespeer-to-peernetworksandApacheHadoop
l Peertopeernetworksinvolvemillionsofmachinesconnectedinanetwork:decentralizedanddistributednetworkarchitecturewherethenodesinthenetworksserveasconsumeresources
l Bottleneckinpeertopeernetworksarisesinthecommunicationbetweendifferentnodes
l FeatureofMPI(MessagePassingInterface):statepreservingprocess;processcanliveaslongasthesystemrunsandthereisnoneedtoreadagainthesamedata(suchasinthecaseofMAPREDUCEscheme)
HorizontalScalingl InMPIarchitecturealltheparameterscanbepreservedlocally.MPIiswellsuitedforiterativeprocessing.ItisemployedbytheMaster/Slaveparadigm.SlavenodecanbecometheMasterforotherprocesses.
l DrawbacksofMPI:Thefaultintolerance,MPIhasnomechanismtohandlefaults…Asinglenodefailurecancausetheentiresystemshutdown
l WiththeadventofotherframeworkssuchasHadoop,MPIisnotbeingwidelyusedanymore…
HorizontalScaling- ApacheHadoopl ApacheHadoopisanopensourceframeworkforstoringandprocessinglargedatasetsusingclustersofcommodityhardware
l Scaleuptohundredsandeventhousandsofnodes
l Highlyfaulttolerant
HADOOPStorageandAnalysisat
InternetScale
Hadoop- DATAl Weliveinthedataage:
l NewYorkStockExchangegeneratesaboutoneterabyteofnewtradedataperday
l Facebookhostsapproximately10billionsphotos
l Ancestry.com(genealogysite)storedabout2.5petabytesofdata
l TheinternetArchivestoresaround2petabytesofdata…andisgrowingatarateof20terabytespermonth
THEGOODNEWSISTHATBIGDATAISHERE.THEBADNEWSISTHATWEARESTRUGGLINGTOSTOREANDANALYZEIT.
Hadoop– BIGDATAl ThefirstproblemrelativetoDataStorageandAnalysisistosolvehardwarefailureassoonasyoustartusingmanypiecesofhardware,thechancethatonewillfailisfairlyhigh…
l Thesecondproblemisthatmostanalysistasksneedtobeabletocombinethedatainsomewaydatareadfromonediskmayneedtobecombinedwithdatafrommultiplesources
- Allowingdatatobecombined frommultiple sourcesisnotoriously challenging
Hadoop…Let’sgo!l Hadoopgetsitsstartin“Nutch”,anopensourcewebsearchengineproject
l OnceGooglepublishedGFSandMAPREDUCEpapers,theroutebecameclear
MapReduceprovidesaprogrammingmodelthatabstractstheproblemfromdiskreadsandwrites,transformingitintoacomputationoversetsofkeysandvalues
MapReduceapproachmayseemlikeabrute-forceapproach,theentiredatasetisprocessedforeachquery.Itisabatchqueryprocessor
GLOBALHADOOP…
GLOBALHADOOP…
GLOBALHADOOP…
TraditionalRDBMSvsMapReduce
TraditionalRDBMSvsMapReduce
l RDBMSworkswellwithstructureddata
l MapReduceworkswellonunstructuredandonsemi-structureddatasinceitisdesignedtointepretthedataatprocessingtime
l MapReducecanbeseensuchasacomplementtoanRDBMS
MAPREDUCEMODELl TheprogrammingmodelusedinHadoopisMapReduce,proposedbyDeanandGhemawatatGoogle
l MapReduceisthebasicdataprocessingschemeusedinHadoopwhichincludesbreakingtheentiretaskintotwoparts,knownasMappersandReducers
l MapReducemodelconsistsoftwofunctions:amapfunctionandareducefunction.EachofthefunctionsdefinesamappingfromonesetofKEY-VALUEpairstoanother
MAPREDUCE– MODELl MappersreadthedatafromHDFS,processitandgeneratesomeintermediateresultstothereducers
l ReducersareusedtoaggregatetheintermediateresultstogeneratethefinaloutputwhichisagainwrittentoHDFS
l AtypicalHadoopjobinvolvesrunningseveralmappersandreducersacrossdifferentnodesinthecluster
MAPREDUCE- MODELl OneofthemajordrawbacksofMapReduceisitsinefficiencyinrunningiterativealgorithms.MapReduceisnotdesignedforiterativeprocesses
l Mappersreadthesamedataagainandagainfromthedisk.Hence,aftereachiteration,theresultshavetobewrittentothedisktopassthemontothenextiteration.Thismakesdiskaccessamajorbottleneckwhichsignificantlydegradestheperformance
AlternativetoApacheHadoop
l ApacheprojectcalledHaLoop extendsMapReducewithprogrammingsupportforiterativealgorithmsandimprovesefficiencybyaddingcachingmechanisms
l CGLMapReduce isanotherworkthatfocusesonimprovingtheperformanceofMapReduceiterativetasks.OtherexamplesofiterativeMapReduceincludeTwister andimapreduce
Spark– NextGenerationDataAnalysisParadigm
l Spark isanextgenerationparadigmforbigdataprocessingdevelopedbyresearchersatBerkleyUniversity
l ItisalternativetoHadoopwhichisdesignedtoovercomethediskI/Olimitations andimprovetheperformanceofearliersystems.ThemajorfeatureofSparkthatmakes ituniqueisitsabilitytoperformin-memorycomputations.Itallowsthedatatobecachedinmemory,thuseliminating theHadoop diskoverheadlimitationforiterativetasks
l Sparkisageneralenginelarge-scaledataprocessingthatsupportsJava,ScalaandPythonandforcertaintasksitistestedtobeupto100xfasterthanHadoopMapReduce
HADOOPSYSTEM
HADOOPComponents:l Common – AsetofcomponentandinterfacesfordistributedfilesystemsandgeneralI/O(serialization,JavaRPC,persistentdatastructures)
l Avro – Aserializationsystemforefficient,cross-languageRPC,persistentdata-storage
l MapReduce – Adistributeddataprocessingmodelandexecutionenvironmentthatrunsonlargeclustersofcommoditymachines
l HDFS – AdistributedFilesystemthatrunsonlargeclustersofcommoditymachines
HADOOPComponents:l Pig – ADataflowlanguageandexecutionenvironmentforexploringverylargedatasets.PigrunsonHDFSandMAPREDUCEclusters
l Hive – Distributeddatawarehouse.ItmanagesdatastoredinHDFSandprovidesaquerylanguagebasedonSQL,translatedtoMapReducejobsbyruntimeengine
l Hbase – DistributeColumn-orienteddatabase.Hbasesupportsbothbatch-stylecomputationsusingMAPREDUCEandpointqueries
l ZooKeeper – Distributedcoordinationservice. Itprovidesprimitivessuchasdistributed locksthatcanbeusedforbuildingdistributedapplications
l Sqoop – ToolformovingdatabetweenRelationalDBandHDFS
HadoopDistributedFileSystem
HDFS– keyfeatures
lHDFSishighlyfault-tolerant,withhighthroughput,suitableforapplicationswithlargedatasets,streamingaccesstofilesystemdataandcanbebuiltoutofcommodityhardware
HDFS– LargeDatasetl LargeDatasets – Processseveralsmalldatasetsranginginsomemegabytesorevenafewgigabytes
l - ThearchitectureofHDFSisdesignedinsuchawaythatitisbestfittostoreandretrievehugeamountofdata.Whatisrequiredishighcumulativedatabandwidthandthescalabilityfeaturetospreadoutfromasinglenodeclustertoahundredorathousand-nodecluster
HDFS– WriteOnceReadMany
l AfileinHDFSoncewrittenwillnotbemodified,thoughitcanbeaccessnnumberoftimes(thoughfutureversionsofHadoopmaysupportthisfeaturetoo)!Atpresent,inHDFSstrictlyhasonewriteratanytime
HDFS- StreamingDataAccess
l AsHDFSworksontheprincipleofWriteOnce,ReadMany,thefeatureofstreamingdataaccessisextremelyimportantinHDFS.AsHDFSisdesignedmoreforbatchprocessingratherthaninteractiveusebyusers
HDFS– CommodityHardware
lHDFS assumesthatthecluster(s)willrunoncommonhardware,thatis,non-expensive,ordinarymachinesratherthanhigh-availabilitysystems
HDFS– DatareplicationandFaulttolerance
HDFS– DatareplicationandFaulttolerance
l InHDFS,thefilesaredividedintolargeblocksofdataandeachblockisstoredonthreenodes:twoonthesamerackandoneonadifferentrackforfaulttolerance.Ablockistheamountofdatastoredoneverydatanode.Thoughthedefaultblocksizeis64MBandthereplicationfactoristhree,theseareconfigurableperfile
HDFS- HighThroughput
l InHadoopHDFS,whenwewanttoperformataskoranaction,thentheworkisdividedandsharedamongdifferentsystems.So,allthesystemswillbeexecutingthetasksassignedtothemindependentlyandinparallel.Sotheworkwillbecompletedinaveryshortperiodoftime
HDFSMovingComputationsisbetterthanMovingData
lHDFSworksontheprinciplethatifacomputationisdonebyanapplicationnearthedataitoperateson,itismuchmoreefficientthandonefarof,particularlywhentherearelargedatasets.Themajoradvantageisreductioninthenetworkcongestionandincreasedoverallthroughputofthesystem
HDFS–Filesystemnamespace
lHDFSnamespacehierarchyissimilartomostoftheotherexistingfilesystems,whereonecancreateanddeletefilesorrelocateafilefromonedirectorytoanother,orevenrenameafile(Hierarchicalfileorganization)
Hadoop– Alternativefilesysteml SomeHadoopusershavestrictdemandsaroundperformance,availabilityandenterprise-
gradefeatures,whileothersaren’tkeenofitsdirect-attachedstorage(DAS)architecture
l ThereisagrowingnumberofoptionsforreplacingHDFSl Cassandra (DataStax);l Ceph;l DispersedStorageNetwork(Cleversafe)l GPFS(IBM)l Isilon(EMC)l Lustrel MapRFileSysteml NetAppOpenSolutionforHadoop
HDFS– HadoopDistributedFileSystemMasterSlaveArchitecture
l HDFS hasamaster/slave architecture
l AnHDFSclusterconsistsofasingleNameNode,amasterserverthatmanagesthefilesystemnamespaceandregulatesaccesstofilesbyclients
l ThereareanumberofDataNodes,usuallyonepernode inthecluster,whichmanagestorageattachedtothenodesthattheyrunon
l HDFS exposesafilesystemnamespaceandallowsuserdatatobestoredinfiles.Internally,afileissplitintooneormoreblocksandtheseblocksarestoredinasetofDataNodes
HDFS– MasterSlaveArchitecturel TheNameNodeexecutesfilesystemnamespaceoperationslikeopening,closing,andrenamingfilesanddirectories
ItalsodeterminesthemappingofblockstoDataNodes.TheDataNodesareresponsibleforservingreadandwriterequestsfromthefilesystem’sclients
TheDataNodesalsoperformblockcreation,deletion,andreplicationuponinstructionfromtheNameNode
Hadoop– JobTracker&TaskTrackerl JobTracker istheservicewithinHadoopthatfarmsoutMapReducetasks tospecificnodesinthecluster,ideallythenodesthathavethedataor,atleast,areinthesamerack
l Clientapplications submitjobstotheJobTracker
l JobTrackertalkstotheNameNode todeterminethelocationofthedata
l JobTrackerlocatesTaskTracker nodeswithavailableslotsatornearthedataTheJobTrackersubmitstheworktothechosenTaskTrackernodes
l TaskTrackernodes arepermanentlymonitored.Iftheydonotsubmitheartbeatsignalsoftenenough,theyaredeemedtohavefailedandtheworkisscheduledonadifferentTaskTracker
Hadoop– JobTracker&Tasktrackerl ATaskTrackerwillnotifytheJobTrackerwhenataskfails.JobTrackerthendecideswhattodo:itmayresubmitthejobelsewhere,itmaymarkthatspecificrecordassomethingtoavoid,anditmayevenblacklisttheTaskTrackerasunreliable
l Whentheworkiscompleted,JobTrackerupdatesitsstatus.ClientapplicationscanpolltheJobTrackerforinformation.JobTrackerisapointoffailurefortheHadoopMapReduceservice.Ifitgoesdown,allrunningjobsarehalted
Hadoop– Node&Tracker
HadoopMapReduce- Scheme
ApacheHadoopl http://hadoop.apache.org/releases.html DownloadthereleasesofHadoop
l http://hadoop.apache.org/docs/current/ Docs
l http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html SingleNodeSetup
l http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html ClusterSetup
Hadoopl Standaloneà OneMachineà OneJvmProcess
l PseudoDistributedàOneMachineà severalJvmProcesses
l FullyDistributedà SeveralMachinesà JvmMachines
l SSHà Public&privatekeyMasterà Toallnodes
MapReduceARchitecturel JobClient
l Submit Jobs
l JobTrackerlCoordinatesThejobs
lScheduling theJobs
l TaskTrackerl Breaksdownthejobintwotasks:MAP,REDUCE
1. JobClientsubmitsajobtoaJobTracker2. JobTracker sendsaquerytoNameNode,NameNodeneedstoknowDataNodelocations
MAPREDUCE- INTERNALS
INPUTFORMAT
SPLIT SPLIT SPLIT
MAP MAP MAP
SHUFFLE
SORT
REDUCE
OUTPUT
DATANODE(1)INPUTFORMAT
SPLIT SPLIT SPLIT
MAP MAP MAP
SHUFFLE
SORT
REDUCE
OUTPUT
DATANODE(2)
MAPREDUCE- INTERNALS
SPLIT
Shuffle&Sort
REDUCE
MAP
Inputdataisdivided intoSplitsbasedontheinputformat.InputSplitsequatetoamaptaskrunninginparallel
Mapperstransforminput splitsintokey/valuepairsbasedonuserdefinedcode
MoveMappersoutput tothereducers,orderedbykeyvalues
Reducersaggregatekeyvaluesbasedonuserdefinedcode
OutputFormats
DefinehowresultswillbewrittenonHDFS
EXAMPLEofMAPREDUCEl Counttheoccurrencesofwordsinatext:
NUGGETSCISCOMSCISCONUGGETSWEBLINUXWINDOWSITNUGGETLINUXCERTSHADOOPAWSIT
INPUTSPLIT
NUGGETSCISCOMS
CISCONUGGETSWEB
LINUXWINDOWSIT
NUGGETSLINUXCERTS
HADOOPAWSIT
… MAP
EXAMPLEOFMAPREDUCE
SPLITNUGGETSCISCOMS
CISCONUGGETSWEB
LINUXWINDOWSIT
NUGGETSLINUXCERTS
HADOOPAWSIT
MAPNUGGETS(1)
CISCO(1)MS(1)
CISCO(1)NUGGETS(1)
WEB(1)
LINUX(1)WINDOWS(1)
IT(1)
NUGGETS(1)LINUX(1)CERTS(1)
HADOOP (1)AWS(1)
IT(1)
SHUFFLE/SORTNUGGETS(1)NUGGETS(1)NUGGETS(1)
CISCO(1)CISCO(1)
MS(1)
LINUX(1)LINUX(1)
WINDOWS(1)
WEB(1)
IT(1)IT(1)
CERTS(1)
AWS(1)
HADOOP(1)
... ...
EXAMPLEOFMAPREDUCESHUFFLE/SORT
NUGGETS(1)NUGGETS(1)NUGGETS(1)
CISCO(1)CISCO(1)
MS(1)
LINUX(1)LINUX(1)
WINDOWS(1)
WEB(1)
IT(1)IT(1)
CERTS(1)
AWS(1)
HADOOP(1)
...
REDUCECISCO(2)
MS(1)
WEB(1)
LINUX(2)
WINDOWS(1)
IT(2)
HADOOP (1)
AWS(1)
NUGGETS(3)
CERTS(1)
Theoutput isthepair(Key,Value)
Hadoop– SingleNode
Hadoop– SingleNode
Hadoop– PseudoDistributedOperation
Hadoop– PseudoDistributedOperations
Hadoop– PseudoDistributedOperations
BigDataAnalytics– CaseUse
l Bigdataanalyticsforsmartmobility
l Estimateofvisitorsmobilityflows
l Correlationbetweenmobilephonedataandspatialpatterns
l FinancialServices
l Education
BigDataAnalytics– CaseUsel Health
l Agriculture
l MarketCompetitiveness
l BehaviorPrediction
l TextAnalytics
l TelecommunicationIndustry
Clouderaprovidesascalable,flexible,integratedplatformthatmakesiteasytomanagerapidlyincreasingvolumesandvarietiesofdataClouderaproductsandsolutions enableyoutodeployandmanageApacheHadoopandrelatedprojects,manipulateandanalyzeyourdata,andkeepthatdatasecureandprotected.