big data with apache hadoop

69
Ing. Alessandro Bruno – Ph.D. Research Fellow @CVIP Group Palermo University BIG Data Analytics with Hadoop ADI - Free Software & Scientific Research Session Palermo 24-26 th June 2016

Upload: alessandro-bruno

Post on 15-Apr-2017

234 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Big data with apache hadoop

Ing.AlessandroBruno– Ph.D.Research Fellow@CVIPGroup

PalermoUniversity

BIGDataAnalyticswithHadoop

ADI- FreeSoftware&ScientificResearch Session

Palermo24-26th June 2016

Page 2: Big data with apache hadoop

BigDataAnalytics

Page 3: Big data with apache hadoop

BigDataAnalytics

l“Inpioneerdaystheyusedoxenforheavypulling,andwhenoneoxcouldn’tbudgealog,theydidn’ttrytogrowalargerox.Weshouldbetryingforbiggercomputers,butformoresystemsofcomputers”.GraceHopper

Page 4: Big data with apache hadoop

BigDataAnalyticsl BigDataisdrivingradicalchangesintraditionaldataanalysisplatforms

l Performinganykindofanalysisinvoluminousandcomplexdata

l Scalingupthehardwareplatformsbecomesimminentandchoosingtherighthardware/softwareplatformsbecomesacrucialdecision

l Satisfyrequestinreasonableamountoftime

[link] [link] [link]

Page 5: Big data with apache hadoop

BigDataAnalytics

l Theabilityofbigdataplatformstoadpattoincreaseddataprocessingdemandsplaysacriticalrole:Isitappropriatetobuildtheanalyticsbasedsolutionsonaparticularplatform?

l Whatarethefundamentalissuesbeforemakingtherightdecisions?

Page 6: Big data with apache hadoop

FundamentalIssuesl Howquicklydoweneedtogettheresults?

l HowBigistheDatatobeprocessed?

l Doesthemodelbuildingrequireseveraliterationsor asingleiteration?

l Willtherebeaneedformoredataprocessingcapabilityinthefuture?

l Isdatatransfercriticalforthisapplication?

l Isthereanyneedforhandlinghardwarefailureswithintheapplications?

Page 7: Big data with apache hadoop

SomeexamplesofBigDatal 1.06billionmonthlyactiveusersonfacebookand680millionmobileusers

l Onanaverage,3.2billionlikesandcommentsarepostedeverydayonFacebook.

l 72%ofwebaudienceisonFacebook.

l FacebookstartedusingHadoopinmid-2009andwasoneoftheinitialusersofHadoop

l Facebookisgenerating500+terabytesofdataperday

Page 8: Big data with apache hadoop

SomeexamplesofBigDatal NYSE(NewYorkStockExchange)generatesabout1terabyteofnewtradedataperday,ajetairlinecollects10terabytes ofcensordataforevery 30minutes offlyingtime

l Googleprocessesover40,000searchquerieseverysecondonaverage,whichtranslatestoover3.5billionsearchesperdayand1.2trillionsearchesperyearworldwide

Page 9: Big data with apache hadoop

VforBIGDATA

Volume Variety Velocity Veracity Value

Page 10: Big data with apache hadoop

VforBigDatal Volume,Velocity,Variety,Veracity,Value

l Volumereferstothevastamountsofdatageneratedeverysecond.Justthinkofalltheemails,twittermessages,photos,videoclips,sensordataetc.weproduceandshareeverysecond

l Velocityreferstothespeedatwhichnewdataisgeneratedandthespeedatwhichdatamovesaround.Justthinkofsocialmediamessagesgoingviralinseconds,thespeedatwhichcreditcardtransactionsarecheckedforfraudulentactivities

Page 11: Big data with apache hadoop

VforBigDatal Varietyreferstothedifferenttypesofdatawecannowuse.Inthepastwefocusedonstructureddatathatneatlyfitsintotablesorrelationaldatabases,suchasfinancialdata(e.g.salesbyproductorregion).Infact,80%oftheworld’sdataisnowunstructured,andthereforecan’teasilybeputintotables(thinkofphotos,videosequencesorsocialmediaupdates)

l Veracityreferstothemessinessortrustworthinessofthedata.Withmanyformsofbigdata,qualityandaccuracyarelesscontrollable(justthinkofTwitterpostswithhashtags,abbreviations,typosandcolloquialspeechaswellasthereliabilityandaccuracyofcontent)

Page 12: Big data with apache hadoop

VforBigDatal Value: ThereisstillanotherVtotakeintoaccountwhenlookingatBigData:Value!Youarewellandgoodhavingaccesstobigdatabutunlesswecanturnitintovalueitisuseless.Soyoucansafelyarguethat'value'isthemostimportantVofBigData

Page 13: Big data with apache hadoop

Scalingl Scalingistheabilityofthesystemtoadapttoincreaseddemandsintermsofdataprocessing

l HORIZONTALSCALING(SCALEOUT) :Itinvolvesdistributing theworkloadacrossmanyserverswhichmaybecommoditymachines.Furthermore, multipleindependentmachinesareaddedtogetherinorder toimproveprocessingcapability

l VERTICALSCALING(SCALEUP):Itinvolvesinstallingmoreprocessors,morememoryandfasterhardware,generallywithinasingleserver

Page 14: Big data with apache hadoop

Scalingl DrawbacksandAdvantagesofHorizontalandVerticalScaling

Page 15: Big data with apache hadoop

HorizontalScalingl Scalingoutplatformsincludespeer-to-peernetworksandApacheHadoop

l Peertopeernetworksinvolvemillionsofmachinesconnectedinanetwork:decentralizedanddistributednetworkarchitecturewherethenodesinthenetworksserveasconsumeresources

l Bottleneckinpeertopeernetworksarisesinthecommunicationbetweendifferentnodes

l FeatureofMPI(MessagePassingInterface):statepreservingprocess;processcanliveaslongasthesystemrunsandthereisnoneedtoreadagainthesamedata(suchasinthecaseofMAPREDUCEscheme)

Page 16: Big data with apache hadoop

HorizontalScalingl InMPIarchitecturealltheparameterscanbepreservedlocally.MPIiswellsuitedforiterativeprocessing.ItisemployedbytheMaster/Slaveparadigm.SlavenodecanbecometheMasterforotherprocesses.

l DrawbacksofMPI:Thefaultintolerance,MPIhasnomechanismtohandlefaults…Asinglenodefailurecancausetheentiresystemshutdown

l WiththeadventofotherframeworkssuchasHadoop,MPIisnotbeingwidelyusedanymore…

Page 17: Big data with apache hadoop

HorizontalScaling- ApacheHadoopl ApacheHadoopisanopensourceframeworkforstoringandprocessinglargedatasetsusingclustersofcommodityhardware

l Scaleuptohundredsandeventhousandsofnodes

l Highlyfaulttolerant

Page 18: Big data with apache hadoop

HADOOPStorageandAnalysisat

InternetScale

Page 19: Big data with apache hadoop

Hadoop- DATAl Weliveinthedataage:

l NewYorkStockExchangegeneratesaboutoneterabyteofnewtradedataperday

l Facebookhostsapproximately10billionsphotos

l Ancestry.com(genealogysite)storedabout2.5petabytesofdata

l TheinternetArchivestoresaround2petabytesofdata…andisgrowingatarateof20terabytespermonth

THEGOODNEWSISTHATBIGDATAISHERE.THEBADNEWSISTHATWEARESTRUGGLINGTOSTOREANDANALYZEIT.

Page 20: Big data with apache hadoop

Hadoop– BIGDATAl ThefirstproblemrelativetoDataStorageandAnalysisistosolvehardwarefailureassoonasyoustartusingmanypiecesofhardware,thechancethatonewillfailisfairlyhigh…

l Thesecondproblemisthatmostanalysistasksneedtobeabletocombinethedatainsomewaydatareadfromonediskmayneedtobecombinedwithdatafrommultiplesources

- Allowingdatatobecombined frommultiple sourcesisnotoriously challenging

Page 21: Big data with apache hadoop

Hadoop…Let’sgo!l Hadoopgetsitsstartin“Nutch”,anopensourcewebsearchengineproject

l OnceGooglepublishedGFSandMAPREDUCEpapers,theroutebecameclear

MapReduceprovidesaprogrammingmodelthatabstractstheproblemfromdiskreadsandwrites,transformingitintoacomputationoversetsofkeysandvalues

MapReduceapproachmayseemlikeabrute-forceapproach,theentiredatasetisprocessedforeachquery.Itisabatchqueryprocessor

Page 22: Big data with apache hadoop

GLOBALHADOOP…

Page 23: Big data with apache hadoop

GLOBALHADOOP…

Page 24: Big data with apache hadoop

GLOBALHADOOP…

Page 25: Big data with apache hadoop

TraditionalRDBMSvsMapReduce

Page 26: Big data with apache hadoop

TraditionalRDBMSvsMapReduce

l RDBMSworkswellwithstructureddata

l MapReduceworkswellonunstructuredandonsemi-structureddatasinceitisdesignedtointepretthedataatprocessingtime

l MapReducecanbeseensuchasacomplementtoanRDBMS

Page 27: Big data with apache hadoop

MAPREDUCEMODELl TheprogrammingmodelusedinHadoopisMapReduce,proposedbyDeanandGhemawatatGoogle

l MapReduceisthebasicdataprocessingschemeusedinHadoopwhichincludesbreakingtheentiretaskintotwoparts,knownasMappersandReducers

l MapReducemodelconsistsoftwofunctions:amapfunctionandareducefunction.EachofthefunctionsdefinesamappingfromonesetofKEY-VALUEpairstoanother

Page 28: Big data with apache hadoop

MAPREDUCE– MODELl MappersreadthedatafromHDFS,processitandgeneratesomeintermediateresultstothereducers

l ReducersareusedtoaggregatetheintermediateresultstogeneratethefinaloutputwhichisagainwrittentoHDFS

l AtypicalHadoopjobinvolvesrunningseveralmappersandreducersacrossdifferentnodesinthecluster

Page 29: Big data with apache hadoop

MAPREDUCE- MODELl OneofthemajordrawbacksofMapReduceisitsinefficiencyinrunningiterativealgorithms.MapReduceisnotdesignedforiterativeprocesses

l Mappersreadthesamedataagainandagainfromthedisk.Hence,aftereachiteration,theresultshavetobewrittentothedisktopassthemontothenextiteration.Thismakesdiskaccessamajorbottleneckwhichsignificantlydegradestheperformance

Page 30: Big data with apache hadoop

AlternativetoApacheHadoop

l ApacheprojectcalledHaLoop extendsMapReducewithprogrammingsupportforiterativealgorithmsandimprovesefficiencybyaddingcachingmechanisms

l CGLMapReduce isanotherworkthatfocusesonimprovingtheperformanceofMapReduceiterativetasks.OtherexamplesofiterativeMapReduceincludeTwister andimapreduce

Page 31: Big data with apache hadoop

Spark– NextGenerationDataAnalysisParadigm

l Spark isanextgenerationparadigmforbigdataprocessingdevelopedbyresearchersatBerkleyUniversity

l ItisalternativetoHadoopwhichisdesignedtoovercomethediskI/Olimitations andimprovetheperformanceofearliersystems.ThemajorfeatureofSparkthatmakes ituniqueisitsabilitytoperformin-memorycomputations.Itallowsthedatatobecachedinmemory,thuseliminating theHadoop diskoverheadlimitationforiterativetasks

l Sparkisageneralenginelarge-scaledataprocessingthatsupportsJava,ScalaandPythonandforcertaintasksitistestedtobeupto100xfasterthanHadoopMapReduce

Page 32: Big data with apache hadoop

HADOOPSYSTEM

Page 33: Big data with apache hadoop

HADOOPComponents:l Common – AsetofcomponentandinterfacesfordistributedfilesystemsandgeneralI/O(serialization,JavaRPC,persistentdatastructures)

l Avro – Aserializationsystemforefficient,cross-languageRPC,persistentdata-storage

l MapReduce – Adistributeddataprocessingmodelandexecutionenvironmentthatrunsonlargeclustersofcommoditymachines

l HDFS – AdistributedFilesystemthatrunsonlargeclustersofcommoditymachines

Page 34: Big data with apache hadoop

HADOOPComponents:l Pig – ADataflowlanguageandexecutionenvironmentforexploringverylargedatasets.PigrunsonHDFSandMAPREDUCEclusters

l Hive – Distributeddatawarehouse.ItmanagesdatastoredinHDFSandprovidesaquerylanguagebasedonSQL,translatedtoMapReducejobsbyruntimeengine

l Hbase – DistributeColumn-orienteddatabase.Hbasesupportsbothbatch-stylecomputationsusingMAPREDUCEandpointqueries

l ZooKeeper – Distributedcoordinationservice. Itprovidesprimitivessuchasdistributed locksthatcanbeusedforbuildingdistributedapplications

l Sqoop – ToolformovingdatabetweenRelationalDBandHDFS

Page 35: Big data with apache hadoop

HadoopDistributedFileSystem

Page 36: Big data with apache hadoop

HDFS– keyfeatures

lHDFSishighlyfault-tolerant,withhighthroughput,suitableforapplicationswithlargedatasets,streamingaccesstofilesystemdataandcanbebuiltoutofcommodityhardware

Page 37: Big data with apache hadoop

HDFS– LargeDatasetl LargeDatasets – Processseveralsmalldatasetsranginginsomemegabytesorevenafewgigabytes

l - ThearchitectureofHDFSisdesignedinsuchawaythatitisbestfittostoreandretrievehugeamountofdata.Whatisrequiredishighcumulativedatabandwidthandthescalabilityfeaturetospreadoutfromasinglenodeclustertoahundredorathousand-nodecluster

Page 38: Big data with apache hadoop

HDFS– WriteOnceReadMany

l AfileinHDFSoncewrittenwillnotbemodified,thoughitcanbeaccessnnumberoftimes(thoughfutureversionsofHadoopmaysupportthisfeaturetoo)!Atpresent,inHDFSstrictlyhasonewriteratanytime

Page 39: Big data with apache hadoop

HDFS- StreamingDataAccess

l AsHDFSworksontheprincipleofWriteOnce,ReadMany,thefeatureofstreamingdataaccessisextremelyimportantinHDFS.AsHDFSisdesignedmoreforbatchprocessingratherthaninteractiveusebyusers

Page 40: Big data with apache hadoop

HDFS– CommodityHardware

lHDFS assumesthatthecluster(s)willrunoncommonhardware,thatis,non-expensive,ordinarymachinesratherthanhigh-availabilitysystems

Page 41: Big data with apache hadoop

HDFS– DatareplicationandFaulttolerance

Page 42: Big data with apache hadoop

HDFS– DatareplicationandFaulttolerance

l InHDFS,thefilesaredividedintolargeblocksofdataandeachblockisstoredonthreenodes:twoonthesamerackandoneonadifferentrackforfaulttolerance.Ablockistheamountofdatastoredoneverydatanode.Thoughthedefaultblocksizeis64MBandthereplicationfactoristhree,theseareconfigurableperfile

Page 43: Big data with apache hadoop

HDFS- HighThroughput

l InHadoopHDFS,whenwewanttoperformataskoranaction,thentheworkisdividedandsharedamongdifferentsystems.So,allthesystemswillbeexecutingthetasksassignedtothemindependentlyandinparallel.Sotheworkwillbecompletedinaveryshortperiodoftime

Page 44: Big data with apache hadoop

HDFSMovingComputationsisbetterthanMovingData

lHDFSworksontheprinciplethatifacomputationisdonebyanapplicationnearthedataitoperateson,itismuchmoreefficientthandonefarof,particularlywhentherearelargedatasets.Themajoradvantageisreductioninthenetworkcongestionandincreasedoverallthroughputofthesystem

Page 45: Big data with apache hadoop

HDFS–Filesystemnamespace

lHDFSnamespacehierarchyissimilartomostoftheotherexistingfilesystems,whereonecancreateanddeletefilesorrelocateafilefromonedirectorytoanother,orevenrenameafile(Hierarchicalfileorganization)

Page 46: Big data with apache hadoop

Hadoop– Alternativefilesysteml SomeHadoopusershavestrictdemandsaroundperformance,availabilityandenterprise-

gradefeatures,whileothersaren’tkeenofitsdirect-attachedstorage(DAS)architecture

l ThereisagrowingnumberofoptionsforreplacingHDFSl Cassandra (DataStax);l Ceph;l DispersedStorageNetwork(Cleversafe)l GPFS(IBM)l Isilon(EMC)l Lustrel MapRFileSysteml NetAppOpenSolutionforHadoop

Page 47: Big data with apache hadoop

HDFS– HadoopDistributedFileSystemMasterSlaveArchitecture

l HDFS hasamaster/slave architecture

l AnHDFSclusterconsistsofasingleNameNode,amasterserverthatmanagesthefilesystemnamespaceandregulatesaccesstofilesbyclients

l ThereareanumberofDataNodes,usuallyonepernode inthecluster,whichmanagestorageattachedtothenodesthattheyrunon

l HDFS exposesafilesystemnamespaceandallowsuserdatatobestoredinfiles.Internally,afileissplitintooneormoreblocksandtheseblocksarestoredinasetofDataNodes

Page 48: Big data with apache hadoop

HDFS– MasterSlaveArchitecturel TheNameNodeexecutesfilesystemnamespaceoperationslikeopening,closing,andrenamingfilesanddirectories

ItalsodeterminesthemappingofblockstoDataNodes.TheDataNodesareresponsibleforservingreadandwriterequestsfromthefilesystem’sclients

TheDataNodesalsoperformblockcreation,deletion,andreplicationuponinstructionfromtheNameNode

Page 49: Big data with apache hadoop

Hadoop– JobTracker&TaskTrackerl JobTracker istheservicewithinHadoopthatfarmsoutMapReducetasks tospecificnodesinthecluster,ideallythenodesthathavethedataor,atleast,areinthesamerack

l Clientapplications submitjobstotheJobTracker

l JobTrackertalkstotheNameNode todeterminethelocationofthedata

l JobTrackerlocatesTaskTracker nodeswithavailableslotsatornearthedataTheJobTrackersubmitstheworktothechosenTaskTrackernodes

l TaskTrackernodes arepermanentlymonitored.Iftheydonotsubmitheartbeatsignalsoftenenough,theyaredeemedtohavefailedandtheworkisscheduledonadifferentTaskTracker

Page 50: Big data with apache hadoop

Hadoop– JobTracker&Tasktrackerl ATaskTrackerwillnotifytheJobTrackerwhenataskfails.JobTrackerthendecideswhattodo:itmayresubmitthejobelsewhere,itmaymarkthatspecificrecordassomethingtoavoid,anditmayevenblacklisttheTaskTrackerasunreliable

l Whentheworkiscompleted,JobTrackerupdatesitsstatus.ClientapplicationscanpolltheJobTrackerforinformation.JobTrackerisapointoffailurefortheHadoopMapReduceservice.Ifitgoesdown,allrunningjobsarehalted

Page 51: Big data with apache hadoop

Hadoop– Node&Tracker

Page 52: Big data with apache hadoop

HadoopMapReduce- Scheme

Page 53: Big data with apache hadoop

ApacheHadoopl http://hadoop.apache.org/releases.html DownloadthereleasesofHadoop

l http://hadoop.apache.org/docs/current/ Docs

l http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html SingleNodeSetup

l http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html ClusterSetup

Page 54: Big data with apache hadoop

Hadoopl Standaloneà OneMachineà OneJvmProcess

l PseudoDistributedàOneMachineà severalJvmProcesses

l FullyDistributedà SeveralMachinesà JvmMachines

l SSHà Public&privatekeyMasterà Toallnodes

Page 55: Big data with apache hadoop

MapReduceARchitecturel JobClient

l Submit Jobs

l JobTrackerlCoordinatesThejobs

lScheduling theJobs

l TaskTrackerl Breaksdownthejobintwotasks:MAP,REDUCE

1. JobClientsubmitsajobtoaJobTracker2. JobTracker sendsaquerytoNameNode,NameNodeneedstoknowDataNodelocations

Page 56: Big data with apache hadoop

MAPREDUCE- INTERNALS

INPUTFORMAT

SPLIT SPLIT SPLIT

MAP MAP MAP

SHUFFLE

SORT

REDUCE

OUTPUT

DATANODE(1)INPUTFORMAT

SPLIT SPLIT SPLIT

MAP MAP MAP

SHUFFLE

SORT

REDUCE

OUTPUT

DATANODE(2)

Page 57: Big data with apache hadoop

MAPREDUCE- INTERNALS

SPLIT

Shuffle&Sort

REDUCE

MAP

Inputdataisdivided intoSplitsbasedontheinputformat.InputSplitsequatetoamaptaskrunninginparallel

Mapperstransforminput splitsintokey/valuepairsbasedonuserdefinedcode

MoveMappersoutput tothereducers,orderedbykeyvalues

Reducersaggregatekeyvaluesbasedonuserdefinedcode

OutputFormats

DefinehowresultswillbewrittenonHDFS

Page 58: Big data with apache hadoop

EXAMPLEofMAPREDUCEl Counttheoccurrencesofwordsinatext:

NUGGETSCISCOMSCISCONUGGETSWEBLINUXWINDOWSITNUGGETLINUXCERTSHADOOPAWSIT

INPUTSPLIT

NUGGETSCISCOMS

CISCONUGGETSWEB

LINUXWINDOWSIT

NUGGETSLINUXCERTS

HADOOPAWSIT

… MAP

Page 59: Big data with apache hadoop

EXAMPLEOFMAPREDUCE

SPLITNUGGETSCISCOMS

CISCONUGGETSWEB

LINUXWINDOWSIT

NUGGETSLINUXCERTS

HADOOPAWSIT

MAPNUGGETS(1)

CISCO(1)MS(1)

CISCO(1)NUGGETS(1)

WEB(1)

LINUX(1)WINDOWS(1)

IT(1)

NUGGETS(1)LINUX(1)CERTS(1)

HADOOP (1)AWS(1)

IT(1)

SHUFFLE/SORTNUGGETS(1)NUGGETS(1)NUGGETS(1)

CISCO(1)CISCO(1)

MS(1)

LINUX(1)LINUX(1)

WINDOWS(1)

WEB(1)

IT(1)IT(1)

CERTS(1)

AWS(1)

HADOOP(1)

... ...

Page 60: Big data with apache hadoop

EXAMPLEOFMAPREDUCESHUFFLE/SORT

NUGGETS(1)NUGGETS(1)NUGGETS(1)

CISCO(1)CISCO(1)

MS(1)

LINUX(1)LINUX(1)

WINDOWS(1)

WEB(1)

IT(1)IT(1)

CERTS(1)

AWS(1)

HADOOP(1)

...

REDUCECISCO(2)

MS(1)

WEB(1)

LINUX(2)

WINDOWS(1)

IT(2)

HADOOP (1)

AWS(1)

NUGGETS(3)

CERTS(1)

Theoutput isthepair(Key,Value)

Page 61: Big data with apache hadoop

Hadoop– SingleNode

Page 62: Big data with apache hadoop

Hadoop– SingleNode

Page 63: Big data with apache hadoop

Hadoop– PseudoDistributedOperation

Page 64: Big data with apache hadoop

Hadoop– PseudoDistributedOperations

Page 65: Big data with apache hadoop

Hadoop– PseudoDistributedOperations

Page 66: Big data with apache hadoop

BigDataAnalytics– CaseUse

l Bigdataanalyticsforsmartmobility

l Estimateofvisitorsmobilityflows

l Correlationbetweenmobilephonedataandspatialpatterns

l FinancialServices

l Education

Page 67: Big data with apache hadoop

BigDataAnalytics– CaseUsel Health

l Agriculture

l MarketCompetitiveness

l BehaviorPrediction

l TextAnalytics

l TelecommunicationIndustry

Page 68: Big data with apache hadoop
Page 69: Big data with apache hadoop

Clouderaprovidesascalable,flexible,integratedplatformthatmakesiteasytomanagerapidlyincreasingvolumesandvarietiesofdataClouderaproductsandsolutions enableyoutodeployandmanageApacheHadoopandrelatedprojects,manipulateandanalyzeyourdata,andkeepthatdatasecureandprotected.