big data with apache hadoop

Ing.AlessandroBruno– Ph.D.Research Fellow@CVIPGroup

PalermoUniversity

BIGDataAnalyticswithHadoop

ADI- FreeSoftware&ScientificResearch Session

Palermo24-26th June 2016

BigDataAnalytics

l“Inpioneerdaystheyusedoxenforheavypulling,andwhenoneoxcouldn’tbudgealog,theydidn’ttrytogrowalargerox.Weshouldbetryingforbiggercomputers,butformoresystemsofcomputers”.GraceHopper

BigDataAnalyticsl BigDataisdrivingradicalchangesintraditionaldataanalysisplatforms

l Performinganykindofanalysisinvoluminousandcomplexdata

l Scalingupthehardwareplatformsbecomesimminentandchoosingtherighthardware/softwareplatformsbecomesacrucialdecision

l Satisfyrequestinreasonableamountoftime

[link] [link] [link]

BigDataAnalytics

l Theabilityofbigdataplatformstoadpattoincreaseddataprocessingdemandsplaysacriticalrole:Isitappropriatetobuildtheanalyticsbasedsolutionsonaparticularplatform?

l Whatarethefundamentalissuesbeforemakingtherightdecisions?

FundamentalIssuesl Howquicklydoweneedtogettheresults?

l HowBigistheDatatobeprocessed?

l Doesthemodelbuildingrequireseveraliterationsor asingleiteration?

l Willtherebeaneedformoredataprocessingcapabilityinthefuture?

l Isdatatransfercriticalforthisapplication?

l Isthereanyneedforhandlinghardwarefailureswithintheapplications?

SomeexamplesofBigDatal 1.06billionmonthlyactiveusersonfacebookand680millionmobileusers

l Onanaverage,3.2billionlikesandcommentsarepostedeverydayonFacebook.

l 72%ofwebaudienceisonFacebook.

l FacebookstartedusingHadoopinmid-2009andwasoneoftheinitialusersofHadoop

l Facebookisgenerating500+terabytesofdataperday

SomeexamplesofBigDatal NYSE(NewYorkStockExchange)generatesabout1terabyteofnewtradedataperday,ajetairlinecollects10terabytes ofcensordataforevery 30minutes offlyingtime

l Googleprocessesover40,000searchquerieseverysecondonaverage,whichtranslatestoover3.5billionsearchesperdayand1.2trillionsearchesperyearworldwide

VforBIGDATA

Volume Variety Velocity Veracity Value

VforBigDatal Volume,Velocity,Variety,Veracity,Value

l Volumereferstothevastamountsofdatageneratedeverysecond.Justthinkofalltheemails,twittermessages,photos,videoclips,sensordataetc.weproduceandshareeverysecond

l Velocityreferstothespeedatwhichnewdataisgeneratedandthespeedatwhichdatamovesaround.Justthinkofsocialmediamessagesgoingviralinseconds,thespeedatwhichcreditcardtransactionsarecheckedforfraudulentactivities

VforBigDatal Varietyreferstothedifferenttypesofdatawecannowuse.Inthepastwefocusedonstructureddatathatneatlyfitsintotablesorrelationaldatabases,suchasfinancialdata(e.g.salesbyproductorregion).Infact,80%oftheworld’sdataisnowunstructured,andthereforecan’teasilybeputintotables(thinkofphotos,videosequencesorsocialmediaupdates)

l Veracityreferstothemessinessortrustworthinessofthedata.Withmanyformsofbigdata,qualityandaccuracyarelesscontrollable(justthinkofTwitterpostswithhashtags,abbreviations,typosandcolloquialspeechaswellasthereliabilityandaccuracyofcontent)

VforBigDatal Value: ThereisstillanotherVtotakeintoaccountwhenlookingatBigData:Value!Youarewellandgoodhavingaccesstobigdatabutunlesswecanturnitintovalueitisuseless.Soyoucansafelyarguethat'value'isthemostimportantVofBigData

Scalingl Scalingistheabilityofthesystemtoadapttoincreaseddemandsintermsofdataprocessing

l HORIZONTALSCALING(SCALEOUT) :Itinvolvesdistributing theworkloadacrossmanyserverswhichmaybecommoditymachines.Furthermore, multipleindependentmachinesareaddedtogetherinorder toimproveprocessingcapability

l VERTICALSCALING(SCALEUP):Itinvolvesinstallingmoreprocessors,morememoryandfasterhardware,generallywithinasingleserver

Scalingl DrawbacksandAdvantagesofHorizontalandVerticalScaling

HorizontalScalingl Scalingoutplatformsincludespeer-to-peernetworksandApacheHadoop

l Peertopeernetworksinvolvemillionsofmachinesconnectedinanetwork:decentralizedanddistributednetworkarchitecturewherethenodesinthenetworksserveasconsumeresources

l Bottleneckinpeertopeernetworksarisesinthecommunicationbetweendifferentnodes

l FeatureofMPI(MessagePassingInterface):statepreservingprocess;processcanliveaslongasthesystemrunsandthereisnoneedtoreadagainthesamedata(suchasinthecaseofMAPREDUCEscheme)

HorizontalScalingl InMPIarchitecturealltheparameterscanbepreservedlocally.MPIiswellsuitedforiterativeprocessing.ItisemployedbytheMaster/Slaveparadigm.SlavenodecanbecometheMasterforotherprocesses.

l DrawbacksofMPI:Thefaultintolerance,MPIhasnomechanismtohandlefaults…Asinglenodefailurecancausetheentiresystemshutdown

l WiththeadventofotherframeworkssuchasHadoop,MPIisnotbeingwidelyusedanymore…

HorizontalScaling- ApacheHadoopl ApacheHadoopisanopensourceframeworkforstoringandprocessinglargedatasetsusingclustersofcommodityhardware

l Scaleuptohundredsandeventhousandsofnodes

l Highlyfaulttolerant

HADOOPStorageandAnalysisat

InternetScale

Hadoop- DATAl Weliveinthedataage:

l NewYorkStockExchangegeneratesaboutoneterabyteofnewtradedataperday

l Facebookhostsapproximately10billionsphotos

l Ancestry.com(genealogysite)storedabout2.5petabytesofdata

l TheinternetArchivestoresaround2petabytesofdata…andisgrowingatarateof20terabytespermonth

THEGOODNEWSISTHATBIGDATAISHERE.THEBADNEWSISTHATWEARESTRUGGLINGTOSTOREANDANALYZEIT.

Hadoop– BIGDATAl ThefirstproblemrelativetoDataStorageandAnalysisistosolvehardwarefailureassoonasyoustartusingmanypiecesofhardware,thechancethatonewillfailisfairlyhigh…

l Thesecondproblemisthatmostanalysistasksneedtobeabletocombinethedatainsomewaydatareadfromonediskmayneedtobecombinedwithdatafrommultiplesources

- Allowingdatatobecombined frommultiple sourcesisnotoriously challenging

Hadoop…Let’sgo!l Hadoopgetsitsstartin“Nutch”,anopensourcewebsearchengineproject

l OnceGooglepublishedGFSandMAPREDUCEpapers,theroutebecameclear

MapReduceprovidesaprogrammingmodelthatabstractstheproblemfromdiskreadsandwrites,transformingitintoacomputationoversetsofkeysandvalues

MapReduceapproachmayseemlikeabrute-forceapproach,theentiredatasetisprocessedforeachquery.Itisabatchqueryprocessor

GLOBALHADOOP…

TraditionalRDBMSvsMapReduce

l RDBMSworkswellwithstructureddata

l MapReduceworkswellonunstructuredandonsemi-structureddatasinceitisdesignedtointepretthedataatprocessingtime

l MapReducecanbeseensuchasacomplementtoanRDBMS

MAPREDUCEMODELl TheprogrammingmodelusedinHadoopisMapReduce,proposedbyDeanandGhemawatatGoogle

l MapReduceisthebasicdataprocessingschemeusedinHadoopwhichincludesbreakingtheentiretaskintotwoparts,knownasMappersandReducers

l MapReducemodelconsistsoftwofunctions:amapfunctionandareducefunction.EachofthefunctionsdefinesamappingfromonesetofKEY-VALUEpairstoanother

MAPREDUCE– MODELl MappersreadthedatafromHDFS,processitandgeneratesomeintermediateresultstothereducers

l ReducersareusedtoaggregatetheintermediateresultstogeneratethefinaloutputwhichisagainwrittentoHDFS

l AtypicalHadoopjobinvolvesrunningseveralmappersandreducersacrossdifferentnodesinthecluster

MAPREDUCE- MODELl OneofthemajordrawbacksofMapReduceisitsinefficiencyinrunningiterativealgorithms.MapReduceisnotdesignedforiterativeprocesses

l Mappersreadthesamedataagainandagainfromthedisk.Hence,aftereachiteration,theresultshavetobewrittentothedisktopassthemontothenextiteration.Thismakesdiskaccessamajorbottleneckwhichsignificantlydegradestheperformance

AlternativetoApacheHadoop

l ApacheprojectcalledHaLoop extendsMapReducewithprogrammingsupportforiterativealgorithmsandimprovesefficiencybyaddingcachingmechanisms

l CGLMapReduce isanotherworkthatfocusesonimprovingtheperformanceofMapReduceiterativetasks.OtherexamplesofiterativeMapReduceincludeTwister andimapreduce

Spark– NextGenerationDataAnalysisParadigm

l Spark isanextgenerationparadigmforbigdataprocessingdevelopedbyresearchersatBerkleyUniversity

l ItisalternativetoHadoopwhichisdesignedtoovercomethediskI/Olimitations andimprovetheperformanceofearliersystems.ThemajorfeatureofSparkthatmakes ituniqueisitsabilitytoperformin-memorycomputations.Itallowsthedatatobecachedinmemory,thuseliminating theHadoop diskoverheadlimitationforiterativetasks

l Sparkisageneralenginelarge-scaledataprocessingthatsupportsJava,ScalaandPythonandforcertaintasksitistestedtobeupto100xfasterthanHadoopMapReduce

HADOOPSYSTEM

HADOOPComponents:l Common – AsetofcomponentandinterfacesfordistributedfilesystemsandgeneralI/O(serialization,JavaRPC,persistentdatastructures)

l Avro – Aserializationsystemforefficient,cross-languageRPC,persistentdata-storage

l MapReduce – Adistributeddataprocessingmodelandexecutionenvironmentthatrunsonlargeclustersofcommoditymachines

l HDFS – AdistributedFilesystemthatrunsonlargeclustersofcommoditymachines

HADOOPComponents:l Pig – ADataflowlanguageandexecutionenvironmentforexploringverylargedatasets.PigrunsonHDFSandMAPREDUCEclusters

l Hive – Distributeddatawarehouse.ItmanagesdatastoredinHDFSandprovidesaquerylanguagebasedonSQL,translatedtoMapReducejobsbyruntimeengine

l Hbase – DistributeColumn-orienteddatabase.Hbasesupportsbothbatch-stylecomputationsusingMAPREDUCEandpointqueries

l ZooKeeper – Distributedcoordinationservice. Itprovidesprimitivessuchasdistributed locksthatcanbeusedforbuildingdistributedapplications

l Sqoop – ToolformovingdatabetweenRelationalDBandHDFS

HadoopDistributedFileSystem

HDFS– keyfeatures

lHDFSishighlyfault-tolerant,withhighthroughput,suitableforapplicationswithlargedatasets,streamingaccesstofilesystemdataandcanbebuiltoutofcommodityhardware

HDFS– LargeDatasetl LargeDatasets – Processseveralsmalldatasetsranginginsomemegabytesorevenafewgigabytes

l - ThearchitectureofHDFSisdesignedinsuchawaythatitisbestfittostoreandretrievehugeamountofdata.Whatisrequiredishighcumulativedatabandwidthandthescalabilityfeaturetospreadoutfromasinglenodeclustertoahundredorathousand-nodecluster

HDFS– WriteOnceReadMany

l AfileinHDFSoncewrittenwillnotbemodified,thoughitcanbeaccessnnumberoftimes(thoughfutureversionsofHadoopmaysupportthisfeaturetoo)!Atpresent,inHDFSstrictlyhasonewriteratanytime

HDFS- StreamingDataAccess

l AsHDFSworksontheprincipleofWriteOnce,ReadMany,thefeatureofstreamingdataaccessisextremelyimportantinHDFS.AsHDFSisdesignedmoreforbatchprocessingratherthaninteractiveusebyusers

HDFS– CommodityHardware

lHDFS assumesthatthecluster(s)willrunoncommonhardware,thatis,non-expensive,ordinarymachinesratherthanhigh-availabilitysystems

HDFS– DatareplicationandFaulttolerance

l InHDFS,thefilesaredividedintolargeblocksofdataandeachblockisstoredonthreenodes:twoonthesamerackandoneonadifferentrackforfaulttolerance.Ablockistheamountofdatastoredoneverydatanode.Thoughthedefaultblocksizeis64MBandthereplicationfactoristhree,theseareconfigurableperfile

HDFS- HighThroughput

l InHadoopHDFS,whenwewanttoperformataskoranaction,thentheworkisdividedandsharedamongdifferentsystems.So,allthesystemswillbeexecutingthetasksassignedtothemindependentlyandinparallel.Sotheworkwillbecompletedinaveryshortperiodoftime

HDFSMovingComputationsisbetterthanMovingData

lHDFSworksontheprinciplethatifacomputationisdonebyanapplicationnearthedataitoperateson,itismuchmoreefficientthandonefarof,particularlywhentherearelargedatasets.Themajoradvantageisreductioninthenetworkcongestionandincreasedoverallthroughputofthesystem

HDFS–Filesystemnamespace

lHDFSnamespacehierarchyissimilartomostoftheotherexistingfilesystems,whereonecancreateanddeletefilesorrelocateafilefromonedirectorytoanother,orevenrenameafile(Hierarchicalfileorganization)

Hadoop– Alternativefilesysteml SomeHadoopusershavestrictdemandsaroundperformance,availabilityandenterprise-

gradefeatures,whileothersaren’tkeenofitsdirect-attachedstorage(DAS)architecture

l ThereisagrowingnumberofoptionsforreplacingHDFSl Cassandra (DataStax);l Ceph;l DispersedStorageNetwork(Cleversafe)l GPFS(IBM)l Isilon(EMC)l Lustrel MapRFileSysteml NetAppOpenSolutionforHadoop

HDFS– HadoopDistributedFileSystemMasterSlaveArchitecture

l HDFS hasamaster/slave architecture

l AnHDFSclusterconsistsofasingleNameNode,amasterserverthatmanagesthefilesystemnamespaceandregulatesaccesstofilesbyclients

l ThereareanumberofDataNodes,usuallyonepernode inthecluster,whichmanagestorageattachedtothenodesthattheyrunon

l HDFS exposesafilesystemnamespaceandallowsuserdatatobestoredinfiles.Internally,afileissplitintooneormoreblocksandtheseblocksarestoredinasetofDataNodes

HDFS– MasterSlaveArchitecturel TheNameNodeexecutesfilesystemnamespaceoperationslikeopening,closing,andrenamingfilesanddirectories

ItalsodeterminesthemappingofblockstoDataNodes.TheDataNodesareresponsibleforservingreadandwriterequestsfromthefilesystem’sclients

TheDataNodesalsoperformblockcreation,deletion,andreplicationuponinstructionfromtheNameNode

Hadoop– JobTracker&TaskTrackerl JobTracker istheservicewithinHadoopthatfarmsoutMapReducetasks tospecificnodesinthecluster,ideallythenodesthathavethedataor,atleast,areinthesamerack

l Clientapplications submitjobstotheJobTracker

l JobTrackertalkstotheNameNode todeterminethelocationofthedata

l JobTrackerlocatesTaskTracker nodeswithavailableslotsatornearthedataTheJobTrackersubmitstheworktothechosenTaskTrackernodes

l TaskTrackernodes arepermanentlymonitored.Iftheydonotsubmitheartbeatsignalsoftenenough,theyaredeemedtohavefailedandtheworkisscheduledonadifferentTaskTracker

Hadoop– JobTracker&Tasktrackerl ATaskTrackerwillnotifytheJobTrackerwhenataskfails.JobTrackerthendecideswhattodo:itmayresubmitthejobelsewhere,itmaymarkthatspecificrecordassomethingtoavoid,anditmayevenblacklisttheTaskTrackerasunreliable

l Whentheworkiscompleted,JobTrackerupdatesitsstatus.ClientapplicationscanpolltheJobTrackerforinformation.JobTrackerisapointoffailurefortheHadoopMapReduceservice.Ifitgoesdown,allrunningjobsarehalted

Hadoop– Node&Tracker

HadoopMapReduce- Scheme

ApacheHadoopl http://hadoop.apache.org/releases.html DownloadthereleasesofHadoop

l http://hadoop.apache.org/docs/current/ Docs

l http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html SingleNodeSetup

l http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html ClusterSetup

Hadoopl Standaloneà OneMachineà OneJvmProcess

l PseudoDistributedàOneMachineà severalJvmProcesses

l FullyDistributedà SeveralMachinesà JvmMachines

l SSHà Public&privatekeyMasterà Toallnodes

MapReduceARchitecturel JobClient

l Submit Jobs

l JobTrackerlCoordinatesThejobs

lScheduling theJobs

l TaskTrackerl Breaksdownthejobintwotasks:MAP,REDUCE

1. JobClientsubmitsajobtoaJobTracker2. JobTracker sendsaquerytoNameNode,NameNodeneedstoknowDataNodelocations

MAPREDUCE- INTERNALS

INPUTFORMAT

SPLIT SPLIT SPLIT

MAP MAP MAP

SHUFFLE

REDUCE

OUTPUT

DATANODE(1)INPUTFORMAT

SPLIT SPLIT SPLIT

MAP MAP MAP

SHUFFLE

REDUCE

OUTPUT

DATANODE(2)

MAPREDUCE- INTERNALS

Shuffle&Sort

REDUCE

Inputdataisdivided intoSplitsbasedontheinputformat.InputSplitsequatetoamaptaskrunninginparallel

Mapperstransforminput splitsintokey/valuepairsbasedonuserdefinedcode

MoveMappersoutput tothereducers,orderedbykeyvalues

Reducersaggregatekeyvaluesbasedonuserdefinedcode

OutputFormats

DefinehowresultswillbewrittenonHDFS

EXAMPLEofMAPREDUCEl Counttheoccurrencesofwordsinatext:

NUGGETSCISCOMSCISCONUGGETSWEBLINUXWINDOWSITNUGGETLINUXCERTSHADOOPAWSIT

INPUTSPLIT

NUGGETSCISCOMS

CISCONUGGETSWEB

LINUXWINDOWSIT

NUGGETSLINUXCERTS

HADOOPAWSIT

… MAP

EXAMPLEOFMAPREDUCE

SPLITNUGGETSCISCOMS

CISCONUGGETSWEB

LINUXWINDOWSIT

NUGGETSLINUXCERTS

HADOOPAWSIT

MAPNUGGETS(1)

CISCO(1)MS(1)

CISCO(1)NUGGETS(1)

WEB(1)

LINUX(1)WINDOWS(1)

NUGGETS(1)LINUX(1)CERTS(1)

HADOOP (1)AWS(1)

SHUFFLE/SORTNUGGETS(1)NUGGETS(1)NUGGETS(1)

CISCO(1)CISCO(1)

LINUX(1)LINUX(1)

WINDOWS(1)

WEB(1)

IT(1)IT(1)

CERTS(1)

AWS(1)

HADOOP(1)

... ...

EXAMPLEOFMAPREDUCESHUFFLE/SORT

NUGGETS(1)NUGGETS(1)NUGGETS(1)

CISCO(1)CISCO(1)

LINUX(1)LINUX(1)

WINDOWS(1)

WEB(1)

IT(1)IT(1)

CERTS(1)

AWS(1)

HADOOP(1)

REDUCECISCO(2)

WEB(1)

LINUX(2)

WINDOWS(1)

HADOOP (1)

AWS(1)

NUGGETS(3)

CERTS(1)

Theoutput isthepair(Key,Value)

Hadoop– SingleNode

Hadoop– PseudoDistributedOperation

Hadoop– PseudoDistributedOperations

BigDataAnalytics– CaseUse

l Bigdataanalyticsforsmartmobility

l Estimateofvisitorsmobilityflows

l Correlationbetweenmobilephonedataandspatialpatterns

l FinancialServices

l Education

BigDataAnalytics– CaseUsel Health

l Agriculture

l MarketCompetitiveness

l BehaviorPrediction

l TextAnalytics

l TelecommunicationIndustry

Clouderaprovidesascalable,flexible,integratedplatformthatmakesiteasytomanagerapidlyincreasingvolumesandvarietiesofdataClouderaproductsandsolutions enableyoutodeployandmanageApacheHadoopandrelatedprojects,manipulateandanalyzeyourdata,andkeepthatdatasecureandprotected.

big data with apache hadoop

Engineering

apache calcite for enabling sql access to nosql data ... ·...

applying apache hadoop to nasa's big climate data use cases

introduction to big data analytics on apache hadoop

big neuroscience infrastructure · 2017. 3. 1. · • rdma...

apache hadoop today & tomorrow - snia€¦ · apache hadoop...

big data - hadoop/mapreduce - github pages · enter .....

accelerating big data processing with hadoop, spark and...

introduction to apache...

cw13 big data and apache hadoop by amr awadallah-cloudera

performance-analyse von apache spark und apache...

big data: using arcgis with apache hadoop · big data:...

big data camp la 2014 - apache tajo: a big data warehouse...

big data: concepts, techniques et démonstration de apache...

apache hadoop ecosystem - lias (lab · apache hadoop...

apache hadoop big data technology

apache hadoop ingestion patterns & apache flume

apache spark & hadoop

big data: tecnologie, metodologie e applicazioni per l...

apache metron and apache spot big data tools for...

high performance big data (hibd): accelerating hadoop...