big data with apache hadoop

Ing.AlessandroBruno– Ph.D.Research Fellow@CVIPGroup

PalermoUniversity

BIGDataAnalyticswithHadoop

ADI- FreeSoftware&ScientificResearch Session

Palermo24-26th June 2016

BigDataAnalytics

BigDataAnalytics

l“Inpioneerdaystheyusedoxenforheavypulling,andwhenoneoxcouldn’tbudgealog,theydidn’ttrytogrowalargerox.Weshouldbetryingforbiggercomputers,butformoresystemsofcomputers”.GraceHopper

BigDataAnalyticsl BigDataisdrivingradicalchangesintraditionaldataanalysisplatforms

l Performinganykindofanalysisinvoluminousandcomplexdata

l Scalingupthehardwareplatformsbecomesimminentandchoosingtherighthardware/softwareplatformsbecomesacrucialdecision

l Satisfyrequestinreasonableamountoftime

[link] [link] [link]

BigDataAnalytics

l Theabilityofbigdataplatformstoadpattoincreaseddataprocessingdemandsplaysacriticalrole:Isitappropriatetobuildtheanalyticsbasedsolutionsonaparticularplatform?

l Whatarethefundamentalissuesbeforemakingtherightdecisions?

FundamentalIssuesl Howquicklydoweneedtogettheresults?

l HowBigistheDatatobeprocessed?

l Doesthemodelbuildingrequireseveraliterationsor asingleiteration?

l Willtherebeaneedformoredataprocessingcapabilityinthefuture?

l Isdatatransfercriticalforthisapplication?

l Isthereanyneedforhandlinghardwarefailureswithintheapplications?

SomeexamplesofBigDatal 1.06billionmonthlyactiveusersonfacebookand680millionmobileusers

l Onanaverage,3.2billionlikesandcommentsarepostedeverydayonFacebook.

l 72%ofwebaudienceisonFacebook.

l FacebookstartedusingHadoopinmid-2009andwasoneoftheinitialusersofHadoop

l Facebookisgenerating500+terabytesofdataperday

SomeexamplesofBigDatal NYSE(NewYorkStockExchange)generatesabout1terabyteofnewtradedataperday,ajetairlinecollects10terabytes ofcensordataforevery 30minutes offlyingtime

l Googleprocessesover40,000searchquerieseverysecondonaverage,whichtranslatestoover3.5billionsearchesperdayand1.2trillionsearchesperyearworldwide

VforBIGDATA

Volume Variety Velocity Veracity Value

VforBigDatal Volume,Velocity,Variety,Veracity,Value

l Volumereferstothevastamountsofdatageneratedeverysecond.Justthinkofalltheemails,twittermessages,photos,videoclips,sensordataetc.weproduceandshareeverysecond

l Velocityreferstothespeedatwhichnewdataisgeneratedandthespeedatwhichdatamovesaround.Justthinkofsocialmediamessagesgoingviralinseconds,thespeedatwhichcreditcardtransactionsarecheckedforfraudulentactivities

VforBigDatal Varietyreferstothedifferenttypesofdatawecannowuse.Inthepastwefocusedonstructureddatathatneatlyfitsintotablesorrelationaldatabases,suchasfinancialdata(e.g.salesbyproductorregion).Infact,80%oftheworld’sdataisnowunstructured,andthereforecan’teasilybeputintotables(thinkofphotos,videosequencesorsocialmediaupdates)

l Veracityreferstothemessinessortrustworthinessofthedata.Withmanyformsofbigdata,qualityandaccuracyarelesscontrollable(justthinkofTwitterpostswithhashtags,abbreviations,typosandcolloquialspeechaswellasthereliabilityandaccuracyofcontent)

VforBigDatal Value: ThereisstillanotherVtotakeintoaccountwhenlookingatBigData:Value!Youarewellandgoodhavingaccesstobigdatabutunlesswecanturnitintovalueitisuseless.Soyoucansafelyarguethat'value'isthemostimportantVofBigData

Scalingl Scalingistheabilityofthesystemtoadapttoincreaseddemandsintermsofdataprocessing

l HORIZONTALSCALING(SCALEOUT) :Itinvolvesdistributing theworkloadacrossmanyserverswhichmaybecommoditymachines.Furthermore, multipleindependentmachinesareaddedtogetherinorder toimproveprocessingcapability

l VERTICALSCALING(SCALEUP):Itinvolvesinstallingmoreprocessors,morememoryandfasterhardware,generallywithinasingleserver

Scalingl DrawbacksandAdvantagesofHorizontalandVerticalScaling

HorizontalScalingl Scalingoutplatformsincludespeer-to-peernetworksandApacheHadoop

l Peertopeernetworksinvolvemillionsofmachinesconnectedinanetwork:decentralizedanddistributednetworkarchitecturewherethenodesinthenetworksserveasconsumeresources

l Bottleneckinpeertopeernetworksarisesinthecommunicationbetweendifferentnodes

l FeatureofMPI(MessagePassingInterface):statepreservingprocess;processcanliveaslongasthesystemrunsandthereisnoneedtoreadagainthesamedata(suchasinthecaseofMAPREDUCEscheme)

HorizontalScalingl InMPIarchitecturealltheparameterscanbepreservedlocally.MPIiswellsuitedforiterativeprocessing.ItisemployedbytheMaster/Slaveparadigm.SlavenodecanbecometheMasterforotherprocesses.

l DrawbacksofMPI:Thefaultintolerance,MPIhasnomechanismtohandlefaults…Asinglenodefailurecancausetheentiresystemshutdown

l WiththeadventofotherframeworkssuchasHadoop,MPIisnotbeingwidelyusedanymore…

HorizontalScaling- ApacheHadoopl ApacheHadoopisanopensourceframeworkforstoringandprocessinglargedatasetsusingclustersofcommodityhardware

l Scaleuptohundredsandeventhousandsofnodes

l Highlyfaulttolerant

HADOOPStorageandAnalysisat

InternetScale

Hadoop- DATAl Weliveinthedataage:

l NewYorkStockExchangegeneratesaboutoneterabyteofnewtradedataperday

l Facebookhostsapproximately10billionsphotos

l Ancestry.com(genealogysite)storedabout2.5petabytesofdata

l TheinternetArchivestoresaround2petabytesofdata…andisgrowingatarateof20terabytespermonth

THEGOODNEWSISTHATBIGDATAISHERE.THEBADNEWSISTHATWEARESTRUGGLINGTOSTOREANDANALYZEIT.

Hadoop– BIGDATAl ThefirstproblemrelativetoDataStorageandAnalysisistosolvehardwarefailureassoonasyoustartusingmanypiecesofhardware,thechancethatonewillfailisfairlyhigh…

l Thesecondproblemisthatmostanalysistasksneedtobeabletocombinethedatainsomewaydatareadfromonediskmayneedtobecombinedwithdatafrommultiplesources

- Allowingdatatobecombined frommultiple sourcesisnotoriously challenging

Hadoop…Let’sgo!l Hadoopgetsitsstartin“Nutch”,anopensourcewebsearchengineproject

l OnceGooglepublishedGFSandMAPREDUCEpapers,theroutebecameclear

MapReduceprovidesaprogrammingmodelthatabstractstheproblemfromdiskreadsandwrites,transformingitintoacomputationoversetsofkeysandvalues

MapReduceapproachmayseemlikeabrute-forceapproach,theentiredatasetisprocessedforeachquery.Itisabatchqueryprocessor

GLOBALHADOOP…

TraditionalRDBMSvsMapReduce

TraditionalRDBMSvsMapReduce

l RDBMSworkswellwithstructureddata

l MapReduceworkswellonunstructuredandonsemi-structureddatasinceitisdesignedtointepretthedataatprocessingtime

l MapReducecanbeseensuchasacomplementtoanRDBMS

MAPREDUCEMODELl TheprogrammingmodelusedinHadoopisMapReduce,proposedbyDeanandGhemawatatGoogle

l MapReduceisthebasicdataprocessingschemeusedinHadoopwhichincludesbreakingtheentiretaskintotwoparts,knownasMappersandReducers

l MapReducemodelconsistsoftwofunctions:amapfunctionandareducefunction.EachofthefunctionsdefinesamappingfromonesetofKEY-VALUEpairstoanother

MAPREDUCE– MODELl MappersreadthedatafromHDFS,processitandgeneratesomeintermediateresultstothereducers

l ReducersareusedtoaggregatetheintermediateresultstogeneratethefinaloutputwhichisagainwrittentoHDFS

l AtypicalHadoopjobinvolvesrunningseveralmappersandreducersacrossdifferentnodesinthecluster

MAPREDUCE- MODELl OneofthemajordrawbacksofMapReduceisitsinefficiencyinrunningiterativealgorithms.MapReduceisnotdesignedforiterativeprocesses

l Mappersreadthesamedataagainandagainfromthedisk.Hence,aftereachiteration,theresultshavetobewrittentothedisktopassthemontothenextiteration.Thismakesdiskaccessamajorbottleneckwhichsignificantlydegradestheperformance

AlternativetoApacheHadoop

l ApacheprojectcalledHaLoop extendsMapReducewithprogrammingsupportforiterativealgorithmsandimprovesefficiencybyaddingcachingmechanisms

l CGLMapReduce isanotherworkthatfocusesonimprovingtheperformanceofMapReduceiterativetasks.OtherexamplesofiterativeMapReduceincludeTwister andimapreduce

Spark– NextGenerationDataAnalysisParadigm

l Spark isanextgenerationparadigmforbigdataprocessingdevelopedbyresearchersatBerkleyUniversity

l ItisalternativetoHadoopwhichisdesignedtoovercomethediskI/Olimitations andimprovetheperformanceofearliersystems.ThemajorfeatureofSparkthatmakes ituniqueisitsabilitytoperformin-memorycomputations.Itallowsthedatatobecachedinmemory,thuseliminating theHadoop diskoverheadlimitationforiterativetasks

l Sparkisageneralenginelarge-scaledataprocessingthatsupportsJava,ScalaandPythonandforcertaintasksitistestedtobeupto100xfasterthanHadoopMapReduce

HADOOPSYSTEM

HADOOPComponents:l Common – AsetofcomponentandinterfacesfordistributedfilesystemsandgeneralI/O(serialization,JavaRPC,persistentdatastructures)

l Avro – Aserializationsystemforefficient,cross-languageRPC,persistentdata-storage

l MapReduce – Adistributeddataprocessingmodelandexecutionenvironmentthatrunsonlargeclustersofcommoditymachines

l HDFS – AdistributedFilesystemthatrunsonlargeclustersofcommoditymachines

HADOOPComponents:l Pig – ADataflowlanguageandexecutionenvironmentforexploringverylargedatasets.PigrunsonHDFSandMAPREDUCEclusters

l Hive – Distributeddatawarehouse.ItmanagesdatastoredinHDFSandprovidesaquerylanguagebasedonSQL,translatedtoMapReducejobsbyruntimeengine

l Hbase – DistributeColumn-orienteddatabase.Hbasesupportsbothbatch-stylecomputationsusingMAPREDUCEandpointqueries

l ZooKeeper – Distributedcoordinationservice. Itprovidesprimitivessuchasdistributed locksthatcanbeusedforbuildingdistributedapplications

l Sqoop – ToolformovingdatabetweenRelationalDBandHDFS

HadoopDistributedFileSystem

HDFS– keyfeatures

lHDFSishighlyfault-tolerant,withhighthroughput,suitableforapplicationswithlargedatasets,streamingaccesstofilesystemdataandcanbebuiltoutofcommodityhardware

HDFS– LargeDatasetl LargeDatasets – Processseveralsmalldatasetsranginginsomemegabytesorevenafewgigabytes

l - ThearchitectureofHDFSisdesignedinsuchawaythatitisbestfittostoreandretrievehugeamountofdata.Whatisrequiredishighcumulativedatabandwidthandthescalabilityfeaturetospreadoutfromasinglenodeclustertoahundredorathousand-nodecluster

HDFS– WriteOnceReadMany

l AfileinHDFSoncewrittenwillnotbemodified,thoughitcanbeaccessnnumberoftimes(thoughfutureversionsofHadoopmaysupportthisfeaturetoo)!Atpresent,inHDFSstrictlyhasonewriteratanytime

HDFS- StreamingDataAccess

l AsHDFSworksontheprincipleofWriteOnce,ReadMany,thefeatureofstreamingdataaccessisextremelyimportantinHDFS.AsHDFSisdesignedmoreforbatchprocessingratherthaninteractiveusebyusers

HDFS– CommodityHardware

lHDFS assumesthatthecluster(s)willrunoncommonhardware,thatis,non-expensive,ordinarymachinesratherthanhigh-availabilitysystems

HDFS– DatareplicationandFaulttolerance

HDFS– DatareplicationandFaulttolerance

l InHDFS,thefilesaredividedintolargeblocksofdataandeachblockisstoredonthreenodes:twoonthesamerackandoneonadifferentrackforfaulttolerance.Ablockistheamountofdatastoredoneverydatanode.Thoughthedefaultblocksizeis64MBandthereplicationfactoristhree,theseareconfigurableperfile

HDFS- HighThroughput

l InHadoopHDFS,whenwewanttoperformataskoranaction,thentheworkisdividedandsharedamongdifferentsystems.So,allthesystemswillbeexecutingthetasksassignedtothemindependentlyandinparallel.Sotheworkwillbecompletedinaveryshortperiodoftime

HDFSMovingComputationsisbetterthanMovingData

lHDFSworksontheprinciplethatifacomputationisdonebyanapplicationnearthedataitoperateson,itismuchmoreefficientthandonefarof,particularlywhentherearelargedatasets.Themajoradvantageisreductioninthenetworkcongestionandincreasedoverallthroughputofthesystem

HDFS–Filesystemnamespace

lHDFSnamespacehierarchyissimilartomostoftheotherexistingfilesystems,whereonecancreateanddeletefilesorrelocateafilefromonedirectorytoanother,orevenrenameafile(Hierarchicalfileorganization)

Hadoop– Alternativefilesysteml SomeHadoopusershavestrictdemandsaroundperformance,availabilityandenterprise-

gradefeatures,whileothersaren’tkeenofitsdirect-attachedstorage(DAS)architecture

l ThereisagrowingnumberofoptionsforreplacingHDFSl Cassandra (DataStax);l Ceph;l DispersedStorageNetwork(Cleversafe)l GPFS(IBM)l Isilon(EMC)l Lustrel MapRFileSysteml NetAppOpenSolutionforHadoop

HDFS– HadoopDistributedFileSystemMasterSlaveArchitecture

l HDFS hasamaster/slave architecture

l AnHDFSclusterconsistsofasingleNameNode,amasterserverthatmanagesthefilesystemnamespaceandregulatesaccesstofilesbyclients

l ThereareanumberofDataNodes,usuallyonepernode inthecluster,whichmanagestorageattachedtothenodesthattheyrunon

l HDFS exposesafilesystemnamespaceandallowsuserdatatobestoredinfiles.Internally,afileissplitintooneormoreblocksandtheseblocksarestoredinasetofDataNodes

HDFS– MasterSlaveArchitecturel TheNameNodeexecutesfilesystemnamespaceoperationslikeopening,closing,andrenamingfilesanddirectories

ItalsodeterminesthemappingofblockstoDataNodes.TheDataNodesareresponsibleforservingreadandwriterequestsfromthefilesystem’sclients

TheDataNodesalsoperformblockcreation,deletion,andreplicationuponinstructionfromtheNameNode

Hadoop– JobTracker&TaskTrackerl JobTracker istheservicewithinHadoopthatfarmsoutMapReducetasks tospecificnodesinthecluster,ideallythenodesthathavethedataor,atleast,areinthesamerack

l Clientapplications submitjobstotheJobTracker

l JobTrackertalkstotheNameNode todeterminethelocationofthedata

l JobTrackerlocatesTaskTracker nodeswithavailableslotsatornearthedataTheJobTrackersubmitstheworktothechosenTaskTrackernodes

l TaskTrackernodes arepermanentlymonitored.Iftheydonotsubmitheartbeatsignalsoftenenough,theyaredeemedtohavefailedandtheworkisscheduledonadifferentTaskTracker

Hadoop– JobTracker&Tasktrackerl ATaskTrackerwillnotifytheJobTrackerwhenataskfails.JobTrackerthendecideswhattodo:itmayresubmitthejobelsewhere,itmaymarkthatspecificrecordassomethingtoavoid,anditmayevenblacklisttheTaskTrackerasunreliable

l Whentheworkiscompleted,JobTrackerupdatesitsstatus.ClientapplicationscanpolltheJobTrackerforinformation.JobTrackerisapointoffailurefortheHadoopMapReduceservice.Ifitgoesdown,allrunningjobsarehalted

Hadoop– Node&Tracker

HadoopMapReduce- Scheme

ApacheHadoopl http://hadoop.apache.org/releases.html DownloadthereleasesofHadoop

l http://hadoop.apache.org/docs/current/ Docs

l http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html SingleNodeSetup

l http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html ClusterSetup

Hadoopl Standaloneà OneMachineà OneJvmProcess

l PseudoDistributedàOneMachineà severalJvmProcesses

l FullyDistributedà SeveralMachinesà JvmMachines

l SSHà Public&privatekeyMasterà Toallnodes

MapReduceARchitecturel JobClient

l Submit Jobs

l JobTrackerlCoordinatesThejobs

lScheduling theJobs

l TaskTrackerl Breaksdownthejobintwotasks:MAP,REDUCE

1. JobClientsubmitsajobtoaJobTracker2. JobTracker sendsaquerytoNameNode,NameNodeneedstoknowDataNodelocations

MAPREDUCE- INTERNALS

INPUTFORMAT

SPLIT SPLIT SPLIT

MAP MAP MAP

SHUFFLE

SORT

REDUCE

OUTPUT

DATANODE(1)INPUTFORMAT

SPLIT SPLIT SPLIT

MAP MAP MAP

SHUFFLE

SORT

REDUCE

OUTPUT

DATANODE(2)

MAPREDUCE- INTERNALS

SPLIT

Shuffle&Sort

REDUCE

MAP

Inputdataisdivided intoSplitsbasedontheinputformat.InputSplitsequatetoamaptaskrunninginparallel

Mapperstransforminput splitsintokey/valuepairsbasedonuserdefinedcode

MoveMappersoutput tothereducers,orderedbykeyvalues

Reducersaggregatekeyvaluesbasedonuserdefinedcode

OutputFormats

DefinehowresultswillbewrittenonHDFS

EXAMPLEofMAPREDUCEl Counttheoccurrencesofwordsinatext:

NUGGETSCISCOMSCISCONUGGETSWEBLINUXWINDOWSITNUGGETLINUXCERTSHADOOPAWSIT

INPUTSPLIT

NUGGETSCISCOMS

CISCONUGGETSWEB

LINUXWINDOWSIT

NUGGETSLINUXCERTS

HADOOPAWSIT

… MAP

EXAMPLEOFMAPREDUCE

SPLITNUGGETSCISCOMS

CISCONUGGETSWEB

LINUXWINDOWSIT

NUGGETSLINUXCERTS

HADOOPAWSIT

MAPNUGGETS(1)

CISCO(1)MS(1)

CISCO(1)NUGGETS(1)

WEB(1)

LINUX(1)WINDOWS(1)

IT(1)

NUGGETS(1)LINUX(1)CERTS(1)

HADOOP (1)AWS(1)

IT(1)

SHUFFLE/SORTNUGGETS(1)NUGGETS(1)NUGGETS(1)

CISCO(1)CISCO(1)

MS(1)

LINUX(1)LINUX(1)

WINDOWS(1)

WEB(1)

IT(1)IT(1)

CERTS(1)

AWS(1)

HADOOP(1)

... ...

EXAMPLEOFMAPREDUCESHUFFLE/SORT

NUGGETS(1)NUGGETS(1)NUGGETS(1)

CISCO(1)CISCO(1)

MS(1)

LINUX(1)LINUX(1)

WINDOWS(1)

WEB(1)

IT(1)IT(1)

CERTS(1)

AWS(1)

HADOOP(1)

...

REDUCECISCO(2)

MS(1)

WEB(1)

LINUX(2)

WINDOWS(1)

IT(2)

HADOOP (1)

AWS(1)

NUGGETS(3)

CERTS(1)

Theoutput isthepair(Key,Value)

Hadoop– SingleNode

Hadoop– PseudoDistributedOperation

Hadoop– PseudoDistributedOperations

BigDataAnalytics– CaseUse

l Bigdataanalyticsforsmartmobility

l Estimateofvisitorsmobilityflows

l Correlationbetweenmobilephonedataandspatialpatterns

l FinancialServices

l Education

BigDataAnalytics– CaseUsel Health

l Agriculture

l MarketCompetitiveness

l BehaviorPrediction

l TextAnalytics

l TelecommunicationIndustry

Clouderaprovidesascalable,flexible,integratedplatformthatmakesiteasytomanagerapidlyincreasingvolumesandvarietiesofdataClouderaproductsandsolutions enableyoutodeployandmanageApacheHadoopandrelatedprojects,manipulateandanalyzeyourdata,andkeepthatdatasecureandprotected.

big data with apache hadoop

Engineering