big data essentials - pdf.ebook777.compdf.ebook777.com/032/b01hpfzrby.pdf · understanding the...

257

Upload: vudieu

Post on 05-Jul-2018

245 views

Category:

Documents


12 download

TRANSCRIPT

Page 1: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 2: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

BigDataEssentialsCopyright©2016byAnilK.Maheshwari,Ph.D.

Bypurchasingthisbook,youagreenottocopyordistributethebookbyanymeans,mechanicalorelectronic.

Nopartofthisbookmaybecopiedortransmittedwithoutwrittenpermission.

Otherbooksbythesameauthor:

DataAnalyticsMadeAccessiblethe#1BestsellerinDataMining

Moksha:LiberationThroughTranscendence

Page 3: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 4: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

PrefaceBigDataisanew,andinclusive,naturalphenomenon.Itisasmessyasnatureitself.ItrequiresanewkindofConsciousnesstofathomitsscaleandscope,anditsmanyopportunitiesandchallenges.UnderstandingtheessentialsofBigDatarequiressuspendingmanyconventionalexpectationsandassumptionsaboutdata…suchascompleteness,clarity,consistency,andconciseness.Fathomingandtamingthemulti-layeredBigDataisadreamthatisslowlybecomingareality.Itisarapidlyevolvingfieldthatisgrowingexponentiallyinvalueandcapabilities.

ThereisagrowingnumberofbooksbeingwrittenonBigData.Theyfallmostlyintwocategories.Thefirstkindfocusonbusinessaspects,anddiscussthestrategicinternalshiftsrequiredforreapingthebusinessbenefitsfromthemanyopportunitiesofferedbyBigData.Thesecondkindfocusonparticulartechnologyplatforms,suchasHadooporSpark.Thisbookaimstobringtogetherthebusinesscontextandthetechnologiesinaseamlessway.

ThisbookwaswrittentomeettheneedsforanintroductoryBigDatacourse.Itismeantforstudents,aswellasexecutives,whowishtotakeadvantageofemergingopportunitiesinBigData.Itprovidesanintuitionofthewholenessofthefieldinasimplelanguage,freefromjargonandcode.AlltheessentialBigDatatechnologytoolsandplatformssuchasHadoop,MapReduce,Spark,andNoSqlarediscussed.MostoftherelevantprogrammingdetailshavebeenmovedtoAppendicestoensurereadability.Theshortchaptersmakeiteasytoquicklyunderstandthekeyconcepts.AcompletecasestudyofdevelopingaBigDataapplicationisincluded.

ThankstoMaharishiMaheshYogiforcreatingawonderfuluniversitywhoseconsciousness-basedenvironmentmadewritingthisevolutionarybookpossible.Thankstomanycurrentandformerstudentsforcontributingtothisbook.DheerajPandeyassistedwiththeWebloganalyzerapplicationanditsdetails.SurajThapaliaassistedwiththeHadoopinstallationguide.EnkhbilegTseeleesurenhelpedwritetheSparktutorial.Thankstomyfamilyforsupportingmeinthisprocess.MydaughtersAnkitaandNupurreviewedthebookandmadehelpfulcomments.MyfatherMr.RLMaheshwariandbrotherDr.SunilMaheshwarialsoreadthebookandenthusiasticallyapprovedit.MycolleagueDr.EdiShivajitooreviewedthebook.

MaytheBigDataForcebewithyou!

Dr.AnilMaheshwari

Page 5: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

August2016,Fairfield,IA

Page 6: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ContentsPreface

Chapter1–WholenessofBigData

Introduction

UnderstandingBigData

CASELET:IBMWatson:ABigDatasystem

CapturingBigData

VolumeofData

VelocityofData

VarietyofData

VeracityofData

BenefittingfromBigData

ManagementofBigData

OrganizingBigData

AnalyzingBigData

TechnologyChallengesforBigData

StoringHugeVolumes

Ingestingstreamsatanextremelyfastpace

Handlingavarietyofformsandfunctionsofdata

Processingdataathugespeeds

ConclusionandSummary

Organizationoftherestofthebook

ReviewQuestions

LibertyStoresCaseExercise:StepB1

Section1

Chapter2-BigDataApplications

Introduction

CASELET:BigDataGetstheFlu

BigDataSources

PeopletoPeopleCommunications

Page 7: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

SocialMedia

PeopletoMachineCommunications

Webaccess

MachinetoMachine(M2M)Communications

RFIDtags

Sensors

BigDataApplications

MonitoringandTrackingApplications

AnalysisandInsightApplications

NewProductDevelopment

Conclusion

ReviewQuestions

LibertyStoresCaseExercise:StepB2

Chapter3-BigDataArchitecture

Introduction

CASELET:GoogleQueryArchitecture

StandardBigdataarchitecture

BigDataArchitectureexamples

IBMWatson

Netflix

Ebay

VMWare

TheWeatherCompany

TicketMaster

LinkedIn

Paypal

CERN

Conclusion

ReviewQuestions

LibertyStoresCaseExercise:StepB3

Section2

Page 8: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Chapter4:DistributedComputingusingHadoop

Introduction

HadoopFramework

HDFSDesignGoals

Master-SlaveArchitecture

Blocksystem

EnsuringDataIntegrity

InstallingHDFS

ReadingandWritingLocalFilesintoHDFS

ReadingandWritingDataStreamsintoHDFS

SequenceFiles

YARN

Conclusion

ReviewQuestions

Chapter5–ParallelProcessingwithMapReduce

Introduction

MapReduceOverview

MapReduceprogramming

MapReduceDataTypesandFormats

WritingMapReduceProgramming

TestingMapReducePrograms

MapReduceJobsExecution

HowMapReduceWorks

ManagingFailures

ShuffleandSort

ProgressandStatusUpdates

HadoopStreaming

Conclusion

ReviewQuestions

Chapter6–NoSQLdatabases

Introduction

Page 9: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

RDBMSVsNoSQL

TypesofNoSQLDatabases

ArchitectureofNoSQL

CAPtheorem

PopularNoSQLDatabases

HBase

ArchitectureOverview

ReadingandWritingData

Cassandra

ArchitectureOverview

ReadingandWritingData

HiveLanguage

HIVELanguageCapabilities

PigLanguage

Conclusion

ReviewQuestions

Chapter7–StreamProcessingwithSpark

Introduction

SparkArchitecture

ResilientDistributedDatasets(RDD)

DirectedAcyclicGraph(DAG)

SparkEcosystem

Sparkforbigdataprocessing

MLlib

SparkGraphX

SparkR

SparkSQL

SparkStreaming

Sparkapplications

SparkvsHadoop

Conclusion

Page 10: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ReviewQuestions

Chapter8–IngestingData

Wholeness

MessagingSystems

PointtoPointMessagingSystem

Publish-SubscribeMessagingSystem

ApacheKafka

UseCases

KafkaArchitecture

Producers

Consumers

Broker

Topic

SummaryofKeyAttributes

Distribution

Guarantees

ClientLibraries

ApacheZooKeeper

KafkaProducerexampleinJava

Conclusion

ReviewQuestions

References

Chapter9–CloudComputingPrimer

Introduction

CloudComputingCharacteristics

In-housestorage

Cloudstorage

CloudComputing:EvolutionofVirtualizedArchitecture

CloudServiceModels

CloudComputingMyths

CloudComputing:GettingStarted

Page 11: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Conclusion

ReviewQuestions

Section3

Chapter10–WebLogAnalyzerapplicationcasestudy

Introduction

Client-ServerArchitecture

WebLoganalyzer

Requirements

SolutionArchitecture

Benefitsofthissolution

Technologystack

ApacheSpark

SparkDeployment

ComponentsofSpark

HDFS

MongoDB

ApacheFlume

OverallApplicationlogic

TechnicalPlanfortheApplication

ScalaSparkcodeforloganalysis

SampleLogdata

SampleInputData:

SampleOutputofWebLogAnalysis

ConclusionandFindings

ReviewQuestions

Chapter10:DataMiningPrimer

Gatheringandselectingdata

Datacleansingandpreparation

OutputsofDataMining

EvaluatingDataMiningResults

DataMiningTechniques

Page 12: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

MiningBigData

FromCausationtoCorrelation

FromSamplingtotheWhole

FromDatasettoDatastream

DataMiningBestPractices

Conclusion

ReviewQuestions

Appendix1:HadoopInstallationonAmazonWebServices(AWS)ElasticComputeCluster(EC2)

CreatingClusterserveronAWS,InstallHadoopfromCloudEra

Step1:CreatingAmazonEC2Servers.

Step2:ConnectingserverandinstallingrequiredClouderadistributionofHadoop

Step3:WordCountusingMapReduce

Appendix2:SparkInstallationandTutorial

Step1:VerifyingJavaInstallation

Step2:VerifyingScalainstallation

Step3:DownloadingScala

Step4:InstallingScala

Step5:DownloadingSpark

Step6:InstallingSpark

Step7:VerifyingtheSparkInstallation

Step8:Application:WordCountinScala

AdditionalResources

AbouttheAuthor

Page 13: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 14: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 15: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Chapter1–WholenessofBigData

Introduction

BigDataisanall-inclusivetermthatreferstoextremelylarge,veryfast,diverse,andcomplexdatathatcannotbemanagedwithtraditionaldatamanagementtools.Ideally,BigDatawouldharnessallkindsofdata,anddelivertherightinformation,totherightperson,intherightquantity,attherighttime,tohelpmaketherightdecision.BigDatacanbemanagedbydevelopinginfinitelyscalable,totallyflexible,andevolutionarydataarchitectures,coupledwiththeuseofextremelycost-effectivecomputingcomponents.TheinfinitepotentialknowledgeembeddedwithinthiscosmiccomputerwouldhelpconnecteverythingtotheUnifiedFieldofallthelawsofnature.

ThisbookwillprovideacompleteoverviewofBigDatafortheexecutiveandthedataspecialist.ThischapterwillcoverthekeychallengesandbenefitsofBigData,andtheessentialtoolsandtechnologiesnowavailablefororganizingandmanipulatingBigData.

Page 16: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

UnderstandingBigData

BigDatacanbeexaminedontwolevels.Onafundamentallevel,itisdatathatcanbeanalyzedandutilizedforthebenefitofthebusiness.Onanotherlevel,itisaspecialkindofdatathatposesuniquechallenges.Thisisthelevelthatthisbookwillfocuson.

Figure1‑1:BigDataContext

Atthelevelofbusiness,datageneratedbybusinessoperations,canbeanalyzedtogenerateinsightsthatcanhelpthebusinessmakebetterdecisions.Thismakesthebusinessgrowbigger,andgenerateevenmoredata,andthecyclecontinues.Thisisrepresentedbythebluecycleonthetop-rightofFigure1.1.ThisaspectisdiscussedinChapter10,aprimeronDataAnalytics.

Onanotherlevel,BigDataisdifferentfromtraditionaldataineveryway:space,time,andfunction.ThequantityofBigDatais1,000timesmorethanthatoftraditionaldata.Thespeedofdatagenerationandtransmissionis1,000timesfaster.TheformsandfunctionsofBigDataaremuchmorediverse:fromnumberstotext,pictures,audio,videos,activitylogs,machinedata,andmore.Therearealsomanymoresourcesofdata,fromindividualstoorganizationstogovernments,usingarangeofdevicesfrommobilephonestocomputerstoindustrialmachines.Notalldatawillbeofequalqualityandvalue.ThisisrepresentedbytheredcycleonthebottomleftofFigure1.1.ThisaspectofBigData,anditsnewtechnologies,isthemainfocusofthisbook.

BigDataismostlyunstructureddata.Everytypeofdataisstructureddifferently,andwillhavetobedealtwithdifferently.TherearehugeopportunitiesfortechnologyproviderstoinnovateandmanagetheentirelifecycleofBigData…togenerate,gather,store,organize,analyze,andvisualizethisdata.

Page 17: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CASELET:IBMWatson:ABigDatasystemIBMcreatedtheWatsonsystemasawayofpushingtheboundariesofArtificialIntelligenceandnaturallanguageunderstandingtechnologies.WatsonbeattheworldchampionhumanplayersofJeopardy(quizstyleTVshow)inFeb2011.WatsonreadsupondataabouteverythingonthewebincludingtheentireWikipedia.Itdigestsandabsorbsthedatabasedonsimplegenericrulessuchas:bookshaveauthors;storieshaveheroes;anddrugstreatailments.Ajeopardyclue,receivedintheformofacrypticphrase,isbrokendownintomanypossiblepotentialsub-cluesofthecorrectanswer.Eachsub-clueisexaminedtoseethelikelinessofitsanswerbeingthecorrectanswerforthemainproblem.Watsoncalculatestheconfidencelevelofeachpossibleanswer.Iftheconfidencelevelreachesmorethanathresholdlevel,itdecidestooffertheanswertotheclue.Itmanagestodoallthisinamere3seconds.

Watsonisnowbeingappliedtodiagnosingdiseases,especiallycancer.Watsoncanreadallthenewresearchpublishedinthemedicaljournalstoupdateitsknowledgebase.Itisbeingusedtodiagnosetheprobabilityofvariousdiseases,byapplyingfactorssuchaspatient’scurrentsymptoms,healthhistory,genetichistory,medicationrecords,andotherfactorstorecommendaparticulardiagnosis.(Source:SmartestmachinesonEarth:youtube.com/watch?v=TCOhyaw5bwg)

Figure1.2:IBMWatsonplayingJeopardy

Q1:WhatkindsofBigDataknowledge,technologiesandskillsarerequiredtobuildasystemlikeWatson?Whatkindofresourcesareneeded?

Q2:WilldoctorsbeabletocompetewithWatsonindiagnosingdiseases

Page 18: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

andprescribingmedications?WhoelsecouldbenefitfromasystemlikeWatson?

Page 19: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CapturingBigDataIfdataweresimplygrowingtoolarge,ORonlymovingtoofast,ORonlybecomingtoodiverse,itwouldberelativelyeasy.However,whenthefourVs(Volume,Velocity,Variety,andVeracity)arrivetogetherinaninteractivemanner,itcreatesaperfectstorm.WhiletheVolumeandVelocityofdatadrivethemajortechnologicalconcernsandthe

costsofmanagingBigData,thesetwoVsarethemselvesbeingdrivenbythe3rdV,theVarietyofformsandfunctionsandsourcesofdata.

VolumeofData

Thequantityofdatahasbeenrelentlesslydoublingevery12-18months.TraditionaldataismeasuredinGigabytes(GB)andTerabytes(TB),butBigDataismeasuredinPetabytes(PB)andExabytes(1Exabyte=1MillionTB).

Thisdataissohugethatitisalmostamiraclethatonecanfindanyspecificthinginit,inareasonableperiodoftime.Searchingtheworld-widewebwasthefirsttrueBigDataapplication.Googleperfectedtheartofthisapplication,anddevelopedmanyofthepath-breakingtechnologiesweseetodaytomanageBigData.

Theprimaryreasonforthegrowthofdataisthedramaticreductioninthecostofstoringdata.Thecostsofstoringdatahavedecreasedby30-40%everyyear.Therefore,thereisanincentivetorecordeverythingthatcanbeobserved.Itiscalled‘datafication’oftheworld.Thecostsofcomputationandcommunicationhavealsobeencomingdown,similarly.Anotherreasonforthegrowthofdataistheincreaseinthenumberofformsandfunctionsofdata.MoreaboutthisintheVarietysection.

VelocityofData

Iftraditionaldataislikealake,BigDataislikeafast-flowingriver.BigDataisbeinggeneratedbybillionsofdevices,andcommunicatedatthespeedoftheinternet.Ingestingallthisdataislikedrinkingfromafirehose.Onedoesnothavecontroloverhowfastthedatawillcome.Ahugeunpredictabledata-streamisthenewmetaphorforthinkingaboutBigData.

Theprimaryreasonfortheincreasedvelocityofdataistheincreaseininternetspeed.Internetspeedsavailabletohomesandofficesarenowincreasingfrom10MB/secto1GB/sec(100timesfaster).Morepeoplearegettingaccesstohigh-speedinternetaroundtheworld.Anotherimportantreasonistheincreasedvarietyofsourcesthatcangenerateandcommunicatedatafromanywhere,atanytime.MoreonthatintheVarietysection.

VarietyofData

Page 20: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Bigdataisinclusiveofallformsofdata,forallkindsoffunctions,fromallsourcesanddevices.Iftraditionaldata,suchasinvoicesandledgerswerelikeasmallstore,BigDataisthebiggestimaginableshoppingmallthatoffersunlimitedvariety.Therearethreemajorkindsofvariety.

1. Thefirstaspectofvarietyistheformofdata.Datatypesrangeinorderofsimplicityandsizefromnumberstotext,graph,map,audio,video,andothers.Therecouldbeacompositeofdatathatincludesmanyelementsinasinglefile.Forexample,textdocumentshavetextandgraphsandpicturesembeddedinthem.Videocanhavechartsandsongsembeddedinthem.Audioandvideohavedifferentandmorecomplexstorageformatsthannumbersandtext.Numbersandtextcanbemoreeasilyanalyzedthananaudioorvideofile.Howshouldcompositeentitiesbestoredandanalyzed?

2. Thesecondaspectisthevarietyoffunctionofdata.Therearehumanchatsandconversationdata,songsandmoviesforentertainment,businesstransactionrecords,machineoperationsperformancedata,newproductdesigndata,olddataforbackup,etc.Humancommunicationdatawouldbeprocessedverydifferentlyfromoperationalperformancedata,withtotallydifferentobjectives.Avarietyofapplicationsareneededtocomparepicturesinordertorecognizepeople’sfaces;comparevoicestoidentifythespeaker;andcomparehandwritingstoidentifythewriter.

3. Thethirdaspectofvarietyisthesourceofdata.Mobilephonesandtabletdevicesenableawideseriesofapplicationsorappstoaccessdataandgeneratedatafromanytimeanywhere.Webaccesslogsareanothernewandhugesourceofdiagnosticdata.ERPsystemsgeneratemassiveamountsofstructuredbusinesstransactionalinformation.Sensorsonmachines,andRFIDtagsonassets,generateincessantandrepetitivedata.Broadlyspeaking,therearethreebroadtypesofsourcesofdata:Human-humancommunications;human-machinecommunications;andmachine-to-machinecommunications.Thesourcesofdata,andtheirrespectiveapplicationsarisingfromthatdata,willbediscussedinthenextchapter.

Page 21: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Figure1.3SourcesofBigData(Source:Hortonworks.com)VeracityofData

Veracityrelatestothebelievabilityandqualityofdata.BigDataismessy.Thereisalotofmisinformationanddisinformation.Thereasonsforpoorqualityofdatacanrangefromhumanandtechnicalerror,tomaliciousintent.

1. Thesourceofinformationmaynotbeauthoritative.Forexample,allwebsitesarenotequallytrustworthy.Anyinformationfromwhitehouse.govorfromnytimes.comismorelikelytobeauthenticandcomplete.Wikipediaisuseful,butnotallpagesareequallyreliable.Thecommunicatormayhaveanagendaorapointofview.

2. Thedatamaynotbereceivedcorrectlybecauseofhumanortechnicalfailure.Sensorsandmachinesforgatheringandcommunicatingdatamaymalfunctionandmayrecordandtransmitincorrectdata.Urgencymayrequirethetransmissionofthebestdataavailableatapointintime.Suchdatamakesreconciliationwithlater,accurate,recordsmoreproblematic.

3. Thedataprovidedandreceived,mayhowever,alsobeintentionallywrong,forcompetitiveorsecurityreasons.

Dataneedstobesiftedandorganizedbyqualityfactors,forittobeputtoanygreatuse.

Page 22: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

BenefittingfromBigDataDatausuallybelongstotheorganizationthatgeneratesit.Thereisotherdata,suchassocialmediadata,thatisfreelyaccessibleunderanopengenerallicense.Organizationscanusethisdatatolearnabouttheirconsumers,improvetheirservicedelivery,anddesignnewproductstodelighttheircustomersandtogainacompetitiveadvantage.Dataisalsolikeanewnaturalresource.Itisbeingusedtodesignnewdigitalproducts,suchason-demandentertainmentandlearning.

Organizationsmaychoosetogatherandstorethisdataforlateranalysis,ortosellittootherorganizations,whomightbenefitfromit.Theymayalsolegitimatelychoosetodiscardpartsoftheirdataforprivacyorlegalreasons.However,organizationscannotaffordtoignoreBigData.OrganizationsthatdonotlearntoengagewithBigData,couldfindthemselvesleftfarbehindtheircompetition,landinginthedustbinofhistory.InnovativesmallandneworganizationscanuseBigDatatoquicklyscaleupandbeatlargerandmorematureorganizations.

BigDataapplicationsexistinallindustriesandaspectsoflife.TherearethreemajortypesofBigDataapplications:MonitoringandTracking,AnalysisandInsight,andnewdigitalproductdevelopment.

MonitoringandTrackingApplications:Consumergoodsproducersusemonitoringandtrackingapplicationstounderstandthesentimentsandneedsoftheircustomers.IndustrialorganizationsuseBigDatatotrackinventoryinmassiveinterlinkedglobalsupplychains.Factoryownersuseittomonitormachineperformanceanddopreventivemaintenance.Utilitycompaniesuseittopredictenergyconsumption,andmanagedemandandsupply.InformationTechnologycompaniesuseittotrackwebsiteperformanceandimproveitsusefulness.Financialorganizationsuseittoprojecttrendsbetterandmakemoreeffectiveandprofitablebets,etc.

AnalysisandInsight:PoliticalorganizationsuseBigDatatomicro-targetvotersandwinelections.PoliceuseBigDatatopredictandpreventcrime.Hospitalsuseittobetterdiagnosediseasesandmakemedicineprescriptions.Adagenciesuseittodesignmoretargetedmarketingcampaignsquickly.Fashiondesignersuseittotracktrendsandcreatemoreinnovativeproducts.

Page 23: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Figure1.4:ThefirstBigDataPresident

NewProductDevelopment:IncomingdatacouldbeusedtodesignnewproductssuchasrealityTVentertainment.Stockmarketfeedscouldbeadigitalproduct.Thisareaneedsmuchmoredevelopment.

Page 24: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ManagementofBigDataManyorganizationshavestartedinitiativesaroundtheuseofBigData.However,mostorganizationsdonotnecessarilyhaveagriponit.HerearesomeemerginginsightsintomakingbetteruseofBigData.

1. Acrossallindustries,thebusinesscaseforBigDataisstronglyfocusedonaddressingcustomer-centricobjectives.ThefirstfocusondeployingBigDatainitiativesistoprotectandenhancecustomerrelationshipsandcustomerexperience.

2. Solvearealpain-point.BigDatashouldbedeployedforspecificbusinessobjectivesinordertohavemanagementavoidbeingoverwhelmedbythesheersizeofitall.

3. Organizationsarebeginningtheirpilotimplementationsbyusingexistingandnewlyaccessibleinternalsourcesofdata.Itisbettertobeginwithdataunderone’scontrolandwhereonehasasuperiorunderstandingofthedata.

4. Puthumansanddatatogethertogetthemostinsight.Combiningdata-basedanalysiswithhumanintuitionandperspectivesisbetterthangoingjustoneway.

5. Advancedanalyticalcapabilitiesarerequired,butlacking,fororganizationstogetthemostvaluefromBigData.Thereisagrowingawarenessofbuildingorhiringthoseskillsandcapabilities.

6. Usemorediversedata,notjustmoredata.Thiswouldprovideabroaderperspectiveintorealityandbetterqualityinsights.

7. Thefasteryouanalyzethedata,themoreitspredictivevalue.Thevalueofdatadepreciateswithtime.Ifthedataisnotprocessedinfiveminutes,thentheimmediateadvantageislost.

8. Don’tthrowawaydataifnoimmediateusecanbeseenforit.Datahasvaluebeyondwhatyouinitiallyanticipate.Datacanaddperspectivetootherdatalateroninamultiplicativemanner.

9. Maintainonecopyofyourdata,notmultiple.Thiswouldhelpavoidconfusionandincreaseefficiency.

10. Planforexponentialgrowth.Dataisexpectedtocontinuetogrowatexponentialrates.Storagecostscontinuetofall,datagenerationcontinuestogrow,data-basedapplicationscontinuetogrowincapabilityandfunctionality.

11. Ascalableandextensibleinformationmanagementfoundationisaprerequisiteforbigdataadvancement.BigDatabuildsuponaresilient,secure,efficient,flexible,andreal-timeinformationprocessingenvironment.

Page 25: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

12. BigDataistransformingbusiness,justlikeITdid.BigDataisanewphaserepresentingadigitalworld.Businessandsocietyarenotimmunetoitsstrongimpacts.

Page 26: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

OrganizingBigData

Goodorganizationdependsuponthepurposeoftheorganization.

Givenhugequantities,itwouldbedesirabletoorganizethedatatospeedupthesearchprocessforfindingaspecific,adesiredthingintheentiredata.Thecostofstoringandprocessingthedata,too,wouldbeamajordriverforthechoiceofanorganizingpattern.

Giventhefastspeedofdata,itwouldbedesirabletocreateascalablenumberofingestpoints.Itwillalsobedesirabletocreateatleastathinveneerofcontroloverthedatabymaintainingcountandaveragesovertime,uniquevaluesreceived,etc.

Giventhevarietyinformfactors,dataneedstobestoredandanalyzeddifferently.Videosneedtobestoredseparatelyandusedforservinginastreamingmode.Textdatamaybecombined,cleaned,andvisualizedforthemesandsentiments.

Givendifferentqualitylevelsofdata,variousdatasourcesmayneedtoberankedandprioritizedbeforeservingthemtotheaudience.Forexample,thequalityofawebpagemaybecomputedthroughaPageRankmechanism.

Page 27: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

AnalyzingBigData

BigDatacanbeanalyzedintwoways.ThesearecalledanalyzingBigDatainmotionorBigDataatrest.Firstwayistoprocesstheincomingstreamofdatainrealtimeforquickandeffectivestatisticsaboutthedata.Theotherwayistostoreandstructurethedataandapplystandardanalyticaltechniquesonbatchesofdataforgeneratinginsights.Thiscouldthenbevisualizedusingreal-timedashboards.BigDatacanbeutilizedtovisualizeaflowingorastaticsituation.Thenatureofprocessingthishuge,diverse,andlargelyunstructureddata,canbelimitedonlybyone’simagination.

Figure1.5:BigDataArchitecture

Amillionpointsofdatacanbeplottedinagraphandofferaviewofthedensityofdata.However,plottingamillionpointsonthegraphmayproduceablurredimagewhichmayhide,ratherthanhighlightthedistinctions.Insuchacase,binningthedatawouldhelp,orselectingthetopfewfrequentcategoriesmaydelivergreaterinsights.Streamingdatacanalsobevisualizedbysimplecountsandaveragesovertime.Forexample,belowisadynamicallyupdatedchartthatshowsup-to-datestatisticsofvisitortraffictomyblogsite,anilmah.com.Thebarshowsthenumberofpageviews,andtheinnerdarkerbarshowsthenumberofuniquevisitors.Thedashboardcouldshowtheviewbydays,weeksoryearsalso.

Page 28: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Figure1.6:Real-timeDashboardforwebsiteperformancefortheauthor’sblog

TextDatacouldbecombined,filtered,cleaned,thematicallyanalyzed,andvisualizedinawordcloud.Hereiswordcloudfromarecentstreamoftweets(ieTwittermessages)fromUSPresidentialcandidatesHillaryClintonandDonaldTrump.Thelargerwordsimpliesgreaterfrequencyofoccurrenceinthetweets.Thiscanhelpunderstandthemajortopicsofdiscussionbetweenthetwo.

Figure1.7:AwordcloudofHillaryClinton’sandDonaldTrump’stweets

Page 29: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

TechnologyChallengesforBigData

Therearefourmajortechnologicalchallenges,andmatchinglayersoftechnologiestomanageBigData.

StoringHugeVolumes

Thefirstchallengerelatestostoringhugequantitiesofdata.Nomachinecanbebigenoughtostoretherelentlesslygrowingquantityofdata.Therefore,dataneedstobestoredinalargenumberofsmallerinexpensivemachines.However,withalargenumberofmachines,thereistheinevitablechallengeofmachinefailure.Eachofthesecommoditymachineswillfailatsomepointoranother.Failureofamachinecouldentailalossofdatastoredonit.

ThefirstlayerofBigDatatechnologyhelpsstorehugevolumesofdata,whileavoidingtheriskofdataloss.Itdistributesdataacrossthelargeclusterofinexpensivecommoditymachines,andensuresthateverypieceofdataisstoredonmultiplemachinestoguaranteethatatleastonecopyisalwaysavailable.Hadoopisthemostwell-knownclusteringtechnologyforBigData.ItsdatastoragepatterniscalledHadoopDistributedFileSystem(HDFS).ThissystemisbuiltonthepatternsofGoogle’sFilesystems,designedtostorebillionsofpagesandsortthemtoanswerusersearchqueries.

Ingestingstreamsatanextremelyfastpace

ThesecondchallengerelatestotheVelocityofdata,i.e.handlingtorrentialstreamsofdata.Someofthemmaybetoolargetostore,butmuststillbeingestedandmonitored.Thesolutionliesincreatingspecialingestingsystemsthatcanopenanunlimitednumberofchannelsforreceivingdata.Thesequeuingsystemscanholddata,fromwhichconsumerapplicationscanrequestandprocessdataattheirownpace.

BigDatatechnologymanagesthisvelocityproblem,usingaspecialstream-processingengine,whereallincomingdataisfedintoacentralqueueingsystem.Fromthere,afork-shapedsystemsendsdatatobatchprocessingaswellastostreamprocessingdirections.Thestreamprocessingenginecandoitsworkwhilethebatchprocessingdoesitswork.ApacheSparkisthemostpopularsystemforstreamingapplications.

Handlingavarietyofformsandfunctionsofdata

ThethirdchallengerelatestothestructuringandaccessofallvarietiesofdatathatcompriseBigData.Storingthemintraditionalflatorrelationalfilestructureswouldbetoowastefulandslow.ThethirdlayerofBigDatatechnologysolvesthisproblemby

Page 30: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

storingthedatainnon-relationalsystemsthatrelaxmanyofthestringentconditionsoftherelationalmodel.ThesearecalledNoSQL(NotOnlySQL)databases.

HBaseandCassandraaretwoofthebetterknownNoSQLdatabasessystems.HBase,forexample,storeseachdataelementseparatelyalongwithitskeyidentifyinginformation.Thisiscalledakey-valuepairformat.Cassandrastoresdatainadocumentformat.TherearemanyothervariantsofNoSQLdatabases.NoSQLlanguages,suchasPigandHive,areusedtoaccessthisdata.

Processingdataathugespeeds

Thefourthchallengerelatestomovinglargeamountsofdatafromstoragetotheprocessor,asthiswouldconsumeenormousnetworkcapacityandchokethenetwork.Thealternativeandinnovativemodewouldbetomovetheprocessortothedata.

ThesecondlayerofBigDatatechnologyavoidsthechokingofthenetwork.Itdistributesthetasklogicthroughouttheclusterofmachineswherethedataisstored.Thosemachineswork,inparallel,onthedataassignedtothem,respectively.Afollow-upprocessconsolidatestheoutputsofallthesmalltasksanddeliversthefinalresults.MapReduce,alsoinventedbyGoogle,isthebest-knowntechnologyforparallelprocessingofdistributedBigData.

Table1.1:TechnologicalchallengesandsolutionsforBigData

Challenge Description Solution Technology

Volume Avoidriskofdatalossfrommachinefailureinclustersofcommoditymachines

Replicatesegmentsofdatainmultiplemachines;masternodekeepstrackofsegmentlocation

HDFS

Volume&Velocity

Avoidchokingofnetworkbandwidthbymovinglargevolumesofdata

Moveprocessinglogictowherethedataisstored;manageusingparallelprocessingalgorithms

Map-Reduce

Variety Efficientstorageoflargeandsmalldataobjects

Columnardatabasesusingkey-pairvaluesformat

HBase,Cassandra

Velocity Monitoringstreamstoolargetostore

Fork-shapedarchitecturetoprocessdataasstreamandasbatch

Spark

Page 31: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Oncethesemajortechnologicalchallengesaremet,alltraditionalanalyticalandpresentationtoolscanbeappliedtoBigData.TherearemanyadditionalsupportivetechnologiestomakethetaskofmanagingBigDataeasier.Forexample,aresourcemanager(suchasYARN)canhelpmonitortheresourceusageandloadbalancingofthemachinesinthecluster.

Page 32: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ConclusionandSummary

BigDataisamajorphenomenonthatimpactseveryone,andisanopportunitytocreatenewwaysofworking.BigDataisextremelylarge,complex,fast,andnotalwaysclean,itisdatathatcomesfrommanysourcessuchaspeople,web,andmachinecommunications.Itneedstobegathered,organizedandprocessedinacost-effectivewaythatmanagesthevolume,velocity,varietyandveracityofBigData.HadoopandSparksystemsarepopulartechnologicalplatformsforthispurpose.HereisalistofthemanydifferencesbetweentraditionalandBigData.

Table1.2:ComparingBigDatawithTraditionalData

Feature TraditionalData BigData

RepresentativeStructure Lake/Pool FlowingStream/river

PrimaryPurpose Managebusinessactivities Communicate,Monitor

Sourceofdata Businesstransactions,documents

Socialmedia,Webaccesslogs,machinegenerated

Volumeofdata Gigabytes,Terabytes Petabytes,Exabytes

Velocityofdata Ingestleveliscontrolled Real-timeunpredictableingest

Varietyofdata Alphanumeric Audio,Video,Graphs,Text

Veracityofdata Clean,moretrustworthy Variesdependingonsource

Structureofdata Well-Structured Semi-orUn-structured

PhysicalStorageofData

InaStorageAreaNetwork

Distributedclustersofcommoditycomputers

Databaseorganization Relationaldatabases NoSQLdatabases

DataAccess SQL NoSQLsuchasPig

DataManipulationConventionaldataprocessing Parallelprocessing

Dynamicdashboardswithsimple

Page 33: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

DataVisualization Varietyoftools measures

DatabaseTools Commercialsystems Open-source-Hadoop,Spark

TotalCostofSystem MediumtoHigh high

Page 34: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

OrganizationoftherestofthebookThisbookwillcoverapplications,architectures,andtheessentialBigDatatechnologies.Therestofthebookisorganizedasfollows.

Section1willdiscusssources,applications,andarchitecturaltopics.Chapter2willdiscussafewcompellingbusinessapplicationsofBigData,basedontheunderstandingofthedifferentsourcesandformatsofdata.Chapter3willcoversomeexamplesofarchitecturesusedbymanyBigDataapplications.

Section2willdiscussthesixmajortechnologyelementsidentifiedintheBigDataEcosystem(Figure1.5).Chapter4willdiscussHadoopandhowitsDistributedFilesystem(HDFS)works.Chapter5willdiscussMapReduceandhowthisparallelprocessingalgorithmworks.Chapter6willdiscussNoSQLdatabasestolearnhowtostructurethedataintodatabasesforfastaccess.PigandHivelanguages,fordataaccess,willbeincluded.Chapter7willcoverstreamingdata,andthesystemsforingestingandprocessingthisdata.ThischapterwillcoverSpark,anintegrated,in-memoryprocessingtoolsettomanageBigData.Chapter8willcoverDataingestsystem,withApacheKafka.Chapter9willbeaprimeronCloudComputingtechnologiesusedforrentingstorageandcomputersatthirdpartylocations.

Section3willincludePrimersandtutorials.Chapter10willpresentacasestudyonthewebloganalyzer,anapplicationthatingestsalogofalargenumberofwebrequestentrieseverydayandcancreatesummaryandexceptionreports.Chapter11willbeaprimerondataanalyticstechnologiesforanalyzingdata.Afulltreatmentcanbefoundinmybook,DataAnalyticsMadeAccessible.Appendix1willbeatutorialoninstallingHadoopclusteronAmazonEC2cloud.Appendix2willbeatutorialoninstallingandusingSpark.

Page 35: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ReviewQuestionsQ1.WhatisBigData?Whyshouldanyonecare?

Q2.Describethe4VmodelofBigData.

Q3.WhatarethemajortechnologicalchallengesinmanagingBigData?

Q4:WhatarethetechnologiesavailabletomanageBigData?

Q5.WhatkindofanalysescanbedoneonBigData?

Q6:WatchClouderaCEOpresenttheevolutionofHadoopathttps://www.youtube.com/watch?v=S9xnYBVqLws.WhydidpeoplenotpayattentiontoHadoopandMapReducewhenitwasintroduced?Whatimplicationsdoesithavetoemergingtechnologies?

Page 36: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

LibertyStoresCaseExercise:StepB1LibertyStoresInc.isaspecializedglobalretailchainthatsellsorganicfood,organicclothing,wellnessproducts,andeducationproductstoenlightenedLOHAS(LifestylesoftheHealthyandSustainable)citizensworldwide.Thecompanyis20yearsold,andisgrowingrapidly.Itnowoperatesin5continents,50countries,150cities,andhas500stores.Itsells20000productsandhas10000employees.Thecompanyhasrevenuesofover$5billionandhasaprofitofabout5%ofitsrevenue.Thecompanypaysspecialattentiontotheconditionsunderwhichtheproductsaregrownandproduced.Itdonatesaboutone-fifth(20%)fromitspre-taxprofitsfromgloballocalcharitablecauses.

Q1:CreateacomprehensiveBigDatastrategyfortheCEOofthecompany.

Q2:HowcanBigDatasystemssuchasIBMWatsonhelpthiscompany?

Page 37: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 38: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Section1

Thissectioncoversthreeimportanthigh-leveltopics.

Chapter2willcoverbigdatasources,andmanyapplicationsinmanyindustries.

Chapter3willarchitecturesformanagingbigdata

Page 39: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 40: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 41: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Chapter2-BigDataApplications

IntroductionIfatraditionalsoftwareapplicationisalovelycat,thenaBigDataapplicationisapowerfultiger.AnidealBigDataapplicationwilltakeadvantageofalltherichnessofdataandproducerelevantinformationtomaketheorganizationresponsiveandsuccessful.BigDataapplicationscanaligntheorganizationwiththetotalityofnaturallaws,thesourceofallsuccess.

Companiesliketheconsumergoodsgiant,Proctor&Gamble,haveinsertedBigDataintoallaspectsofitsplanningandoperations.Theindustrialgiant,Volkswagen,asksallitsbusinessunitstoidentifysomerealisticinitiativeusingBigDatatogrowtheirunit’ssales.Theentertainmentgiant,Netflix,processes400billionuseractionseveryday,andthesearesomeofthebiggestusersofBigData.

Figure2‑0‑1:BigDataapplicationisapowerfultiger(Source:Flickr.com)

Page 42: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CASELET:BigDataGetstheFluGoogleFluTrendswasanenormouslysuccessfulinfluenzaforecastingservice,pioneeredbyGoogle.ItemployedBigData,suchasthestreamofsearchtermsusedinitsubiquitousInternetsearchservice.TheprogramaimedtobetterpredictfluoutbreaksusingdataandinformationfromtheU.S.CentersforDiseaseControlandPrevention(CDC).Whatwasmostamazingwasthatthisapplicationwasabletopredicttheonsetofflu,almosttwoweeksbeforeCDCsawitcoming.From2004tillabout2012itwasabletosuccessfullypredictthetimingandgeographicallocationofthearrivalofthefluseasonaroundtheworld.

Figure2‑0‑2:GoogleFlutrends

However,itfailedspectacularlytopredictthe2013fluoutbreak.DatausedtopredictEbola’sspreadin2014-15yieldedwildlyinaccurateresults,andcreatedamajorpanic.Newspapersacrosstheglobespreadthisapplication’sworst-casescenariosfortheEbolaoutbreakof2014.

GoogleFluTrendsfailedfortworeasons:BigDatahubris,andalgorithmicdynamics,(a)Thequantityofdatadoesnotmeanthatonecanignorefoundationalissuesofmeasurementandconstructvalidityandreliabilityanddependenciesamongdataand(b)GoogleFluTrendspredictionswerebasedonacommercialsearchalgorithmthatfrequentlychanges,basedonGoogle’sbusinessgoals.ThisuncertaintyskewedthedatainwaysevenGoogleengineersdidnotunderstand,evenskewingtheaccuracyofpredictions.Perhapsthebiggestlessonisthatthereisfarlessinformationinthedata,typicallyavailableintheearlystagesofanoutbreak,thanisneededtoparameterizethetestmodels.

Q1:WhatlessonswouldyoulearnfromthedeathofaprominentandhighlysuccessfulBigDataapplication?

Q2:WhatotherBigDataapplicationscanbeinspiredfromthesuccessofthisapplication?

Page 43: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 44: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

BigDataSourcesBigDataisinclusiveofalldataaboutallactivitieseverywhere.Itcan,thus,potentiallytransformourperspectiveonlifeandtheuniverse.Itbringsnewinsightsinreal-timeandcanmakelifehappierandmaketheworldmoreproductive.BigDatacan,however,alsobringperils—intermsofviolationofprivacy,andsocialandeconomicdisruption.

Therearethreemajorcategoriesofdatasources:humancommunications,human-machinecommunications,andmachine-machinecommunications.

Page 45: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

PeopletoPeopleCommunicationsPeopleandcorporationsincreasinglycommunicateoverelectronicnetworks.Distanceandtimehavebeenannihilated.Everyonecommunicatesthroughphoneandemail.Newstravelsinstantly.Influentialnetworkshaveexpanded.Thecontentofcommunicationhasbecomericherandmultimedia.High-resolutioncamerasinmobilephonesenablepeopletotakepicturesandvideos,andinstantlysharethemwithfriendsandfamily.Allthesecommunicationsarestoredinthefacilitiesofmanyintermediaries,suchastelecomandinternetserviceproviders.Socialmediaisanew,butparticularlytransformativetypeofhuman-humancommunications.

SocialMedia

SocialmediaplatformssuchasFacebook,Twitter,LinkedIn,YouTube,Flickr,Tumblr,Skye,Snapchat,andothershavebecomeanincreasinglyintimatepartofmodernlife.Theseareamongthehundredsofsocialmediathatpeopleuseandtheygeneratehugestreamsoftext,pictures,videos,logs,andothermultimediadata.

PeoplesharemessagesandpicturesthroughsocialmediasuchasFacebookandYouTube.TheysharephotoalbumsthroughFlickr.TheycommunicateinshortasynchronousmessageswitheachotheronTwitter.TheymakefriendsonFacebook,andfollowothersonTwitter.Theydovideoconferencing,usingSkypeandleadersdelivermessagesthatsometimesgoviralthroughsocialmedia.AllthesedatastreamsarepartofBigData,andcanbemonitoredandanalyzedtounderstandmanyphenomena,suchaspatternsofcommunication,aswellasthegistoftheconversations.Thesemediahavebeenusedforawidevarietyofpurposeswithstunningeffects.

Figure2‑0‑3:Samplingofmajorsocialmedia

Page 46: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

PeopletoMachineCommunicationsSensorsandwebaretwoofthekindsofmachinesthatpeoplecommunicatewith.PersonalassistantssuchasSiriandCortanaarethelatestinman-machinecommunicationsastheytrytounderstandhumanrequestsinnaturallanguage,andfulfilthem.WearabledevicessuchasFitBitandsmartwatcharesmartdevicesthatread,storeandanalyzepeople’spersonaldatasuchasbloodpressureandweight,foodandexercisedata,andsleeppatterns.Theworld-widewebislikeaknowledgemachinethatpeopleinteractwithtogetanswersfortheirqueries.

Webaccess

Theworld-wide-webhasintegrateditselfintoallpartsofhumanandmachineactivity.Theusageofthetensofbillionsofpagesbybillionsofwebusersgenerateshugeamountofenormouslyvaluableclickstreamdata.Everytimeawebpageisrequested,alogentryisgeneratedattheproviderend.Thewebpageprovidertrackstheidentityoftherequestingdeviceanduser,andtimeandspatiallocationofeachrequest.Ontherequesterside,therearecertainsmallpiecesofcomputercodeanddatacalledcookieswhichtrackthewebpagesreceived,date/timeofaccess,andsomeidentifyinginformationabouttheuser.Allthewebaccesslogs,andcookierecords,canprovidewebusagerecordsthatcanbeanalyzedfordiscoveringopportunitiesformarketingpurposes.

Awebloganalyzerisanapplicationrequiredtomonitorstreamingwebaccesslogsinreal-timetocheckonwebsitehealthandtoflagerrors.Adetailedcasestudyofapracticaldevelopmentofthisapplicationisshowninchapter8.

Page 47: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

MachinetoMachine(M2M)CommunicationsM2McommunicationsisalsosometimescalledtheInternetofThings(IoT).Atrilliondevicesareconnectedtotheinternetandtheycommunicatewitheachotherorsomemastermachines.Allthisdatacanbeaccessedandharnessedbymakersandownersofthosemachines.

Machinesandequipmenthavemanykindsofsensorstomeasurecertainenvironmentalparameters,whichcanbebroadcasttocommunicatetheirstatus.RFIDtagsandsensorsembeddedinmachineshelpgeneratethedata.ContainersonshipsaretaggedwithRFIDtagsthatconveytheirlocationtoallthosewhocanlisten.Similarly,whenpalletsofgoodsaremovedinwarehousesorlargeretainstores,thosepalletscontainelectromagnetic(RFID)tagsthatconveytheirlocation.CarscarryanRFIDtranspondertoidentifythemselvestoautomatedtollboothsandpaythetolls.Robotsinafactory,andinternet-connectedrefrigeratorsinahouse,continuallybroadcasta‘heartbeat’thattheyarefunctionallynormally.Surveillancevideosusingcommoditycamerasareanothermajorsourceofmachine-generateddata.

Automobilescontainsensorsthatrecordandcommunicateoperationaldata.Amoderncarcangeneratemanymegabytesofdataeveryday,andtherearemorethan1billionmotorvehiclesontheroad.Thustheautomotiveindustryitselfgeneratehugeamountsofdata.Self-drivingcarswouldonlyaddtothequantityofdatagenerated.

RFIDtags

AnRFIDtagisaradiotransmitterwithalittleantennathatcanrespondtoandcommunicateessentialinformationtospecialreadersthroughRadioFrequency(RF)channel.Afewyearsago,majorretailerssuchasWalmartdecidedtoinvestinRFIDtechnologytotaketheretailindustrytoanewlevel.ItforcedtheirsupplierstoinvestinRFIDtagsonthesuppliedproducts.Today,almostallretailersandmanufacturershaveimplementedRFID-tagsbasedsolutions.

Figure2‑0‑4:AsmallpassiveRFIDtag

HereishowanRFIDtagworks.WhenapassiveRFIDtagcomesinthevicinityofanRFreaderandis‘tickled’,thetagrespondsbybroadcastingafixedidentifyingcode.An

Page 48: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

activeRFIDtaghasitsownbatteryandstorage,andcanstoreandcommunicatealotmoreinformation.EveryreadingofmessagefromanRFIDtagbyanRFreadercreatesalogentry.ThusthereisasteadystreamofdatafromeveryreaderasitrecordsinformationaboutalltheRFIDtagsinitsareaofinfluence.Therecordsmaybeloggedregularly,andthustherewillbemanymorerecordsthanarenecessarytotrackthelocationandmovementofanitem.Alltheduplicateandredundantrecordsisremoved,toproduceclean,consolidateddataaboutthelocationandstatusofitems.

Sensors

Asensorisasmalldevicethatcanobserveandrecordphysicalorchemicalparameters.Sensorsareeverywhere.Aphotosensorintheelevatorortraindoorcansenseifsomeoneismovingandtothuskeepthedoorfromclosing.ACCTVcameracanrecordavideoforsurveillancepurposes.AGPSdevicecanrecorditsgeographicallocationeverymoment.

Figure2‑0‑5:Anembeddedsensor

Temperaturesensorsinacarcanmeasurethetemperatureoftheengineandthetiresandmore.Thethermostatinabuildingorarefrigeratortoohavetemperaturesensors.Apressuresensorcanmeasurethepressureinsideanindustrialboiler.

Page 49: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

BigDataApplicationsMonitoringandTrackingApplicationsPublicHealthMonitoring

TheUSgovernmentisencouragingallhealthcarestakeholderstoestablishanationalplatformforinteroperabilityanddatasharingstandards.Thiswouldenablesecondaryuseofhealthdata,whichwouldadvanceBigDataanalyticsandpersonalizedholisticprecisionmedicine.Thiswouldbeabroad-basedplatformliketheGoogleFluTrendscase.

ConsumerSentimentMonitoring

SocialMediahasbecomemorepowerfulthanadvertising.Manyconsumergoodscompanieshavemovedabulkoftheirmarketingbudgetsfromtraditionaladvertisingmediaintosocialmedia.TheyhavesetupBigDatalisteningplatforms,whereSocialMediadatastreams(includingtweetsandFacebookpostsandblogposts)arefilteredandanalyzedforcertainkeywordsorsentiments,bycertaindemographicsandregions.Actionableinformationfromthisanalysisisdeliveredtomarketingprofessionalsforappropriateaction,especiallywhentheproductisnewtothemarket.

Figure2‑0‑6:ArchitectureforaListeningPlatform(source:Intelligenthq.com)

Assettracking

TheUSDepartmentofDefenseisencouragingtheindustrytodeviseatinyRFIDchipthatcouldpreventthecounterfeitingofelectronicpartsthatendupinavionicsorcircuitboardsforotherdevices.Airplanesareoneoftheheaviestusersofsensorswhichtrackeveryaspectoftheperformanceofeverypartoftheplane.Thedatacanbedisplayedon

Page 50: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

thedashboard,aswellasstoredforlaterdetailedanalysis.Workingwithcommunicatingdevices,thesesensorscanproduceatorrentofdata.

Theftbyvisitors,shoppersandevenemployees,isamajorsourceoflossofrevenueforretailers.AllvaluableitemsinthestorecanbeassignedRFIDtags,andthegatesofthestoreareequippedwithRFreaders.Thishelpssecuretheproducts,andreduceleakage(theft),fromthestore.

Supplychainmonitoring

AllcontainersonshipscommunicatetheirstatusandlocationusingRFIDtags.Thus,retailersandtheirsupplierscangainreal-timevisibilitytotheinventorythroughouttheglobalsupplychain.Retailerscanknowexactlywheretheitemsareinthewarehouse,andsocanbringthemintothestoreattherighttime.Thisisparticularlyrelevantforseasonalitemsthatneedtobesoldontime,orelsetheywillbesoldatadiscount.Withitem-levelRFIDtacks,retailersalsogainfullvisibilityofeachitemandcanservetheircustomersbetter.

ElectricityConsumptionTracking

Electricutilitiescantrackthestatusofgeneratingandtransmissionsystems,andalsomeasureandpredicttheconsumptionofelectricity.Sophisticatedsensorscanhelpmonitorvoltage,current,frequency,temperature,andothervitaloperatingcharacteristicsofhugeandexpensiveelectricdistributioninfrastructure.Smartmeterscanmeasuretheconsumptionofelectricityatregularintervalsofonehourorless.Thisdataisanalyzedtomakereal-timedecisionstomaximizepowercapacityutilizationandthetotalrevenuegeneration.

PreventiveMachineMaintenance

Allmachines,includingcarsandcomputers,willfailsometime,becauseoneormoreortheircomponentswillfail.Anypreciousequipmentcouldbeequippedwithsensors.Thecontinuousstreamofdatafromthesensorsdatacouldbemonitoredandanalyzedtoforecastthestatusofkeycomponents,andthus,monitortheoverallmachine’shealth.Preventivemaintenancecanbescheduledtoreducethecostofdowntime.

AnalysisandInsightApplications

BigDatacanbestructuredandanalyzedusingdataminingtechniquestoproduceinsightsandpatternsthatcanbeusedtomakebusinessbetter.

PredictivePolicing

Page 51: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

TheLosAngelesPoliceDepartment(LAPD)inventedtheconceptofPredictivePolicing.TheLAPDworkedwithUCBerkeleyresearcherstoanalyzeitslargedatabaseof13millioncrimesrecordedover80years,andpredictedthelikelinessofcrimesofcertaintypes,atcertaintimes,andincertainlocations.Theyidentifiedhotspotsofcrimewherecrimeshadoccurred,andwherecrimewaslikelytohappeninthefuture.Crimepatternsweremathematicallymodeledafterasimpleinsightborrowedfromametaphorofearthquakesanditsaftershocks.Inessence,itsaidthatonceacrimeoccurredinalocation,itrepresentedacertaindisturbanceinharmony,andwouldthus,leadtoagreaterlikelihoodofasimilarcrimeoccurringinthelocalvicinityinthenearfuture.Themodelshowedforeachpolicebeat,thespecificneighborhoodblocksandspecifictimeslots,wherecrimewaslikelytooccur.

Figure2‑0‑7:LAPDofficeronpredictingpolicing(Source:nbclosangeles.com)

Byincludingthepolicecars’patrolschedulesinaccordancewiththemodel’spredictions,theLAPDwasabletoreducecrimeby12%to26%fordifferentcategoriesofcrime.Recently,theSanFranciscoPoliceDepartmentreleaseditsowncrimedataforover2years,sodataanalystscouldmodelthatdataandpreventfuturecrimes.

WinningPoliticalElections

TheUSPresident,BarackObama,wasthefirstmajorpoliticalcandidatetouseBigDatainasignificantway,inthe2008elections.HeisthefirstBigDatapresident.Hiscampaigngathereddataaboutmillionsofpeople,includingsupporters.Theyinventedthe“DonateNow”buttonforuseinemailstoobtaincampaigncontributionsfrommillionsofsupporters.Theycreatedpersonalprofilesofmillionsofsupportersandwhattheyhaddoneandcoulddoforthecampaign.Datawasusedtodetermineundecidedvoterswhocouldbeconvertedtotheirside.Theyprovidedphonenumbersoftheseundecidedvoterstothesupporterstocall,andthenrecordedtheoutcomeofthosecallsallovertheweb,

Page 52: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

usinginteractiveapplications.Obamahimselfusedhistwitteraccounttocommunicatehismessagesdirectlywithhismillionsoffollowers.

Aftertheelections,ObamaconvertedthelistofsupporterstoanadvocacymachinethatwouldprovidethegrassrootssupportforthePresident’sinitiatives.Sincethen,almostallcampaignsuseBigData.SenatorBernieSandersusedthesameBigDataplaybooktobuildaneffectivenationalpoliticalmachinepoweredentirelybysmalldonors.Analyst,NateSilver,createdsophisticalpredictivemodelsusinginputsfrommanypoliticalpollsandsurveystowinpunditstosuccessfullypredictwinnersoftheUSelections.Natewashowever,unsuccessfulinpredictingDonaldTrump’srise,andthatshowsthelimitsofBigData.

PersonalHealth

Correctdiagnosisisthesinequanonofeffectivetreatment.Medicalknowledgeandtechnologyisgrowingbyleapsandbounds.IBMWatsonisaBigDataAnalyticsenginethatingestsandmetabolizesallthemedicalinformationintheworld,andthenappliesitintelligentlytoanindividualsituation.Watsoncanprovideadetailedandaccuratemedicaldiagnosisusingcurrentsymptoms,patienthistory,medicationhistory,andenvironmentaltrends,andotherparameters.SimilarproductsmightbeofferedasanApptolicenseddoctors,andevenindividuals,toimproveproductivityandaccuracyinhealthcare.

NewProductDevelopment

Theseapplicationsaretotallynewconceptsthatdidnotexistearlier.

Flexibleautoinsurance

AnautoinsurancecompanycanusetheGPSdatafromcarstocalculatetheriskofaccidentsbasedontravelpatterns.Theautomobilecompaniescanusethecarsensordatatotracktheperformanceofacar.Saferdriverscanberewardedandtheerrantdriverscanbepenalized.

Page 53: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Figure2‑0‑8:GPSbasedtrackingofvehicles

Location-basedretailpromotion

Aretailer,orathird-partyadvertiser,cantargetcustomerswithspecificpromotionsandcouponsbasedonlocationdataobtainedthroughGPS,thetimeofday,thepresenceofstoresnearby,andmappingittotheconsumerpreferencedataavailablefromsocialmediadatabases.Adsandofferscanbedeliveredthroughmobileapps,SMS,andemail.Theseareexamplesofmobileapps.

Recommendationservice

Ecommerceisafastgrowingindustryinthelastcoupleofdecades.Avarietyofproductsaresoldandsharedovertheinternet.Webusers’browsingandpurchasehistoryonecommercesitesisutilizedtolearnabouttheirpreferencesandneeds,andtoadvertiserelevantproductandpricingoffersinreal-time.Amazonusesapersonalizedrecommendationenginesystemtosuggestnewadditionalproductstoconsumersbasedonaffinitiesofvariousproducts.Netflixalsousesarecommendationenginetosuggestentertainmentoptionstoitsusers.

Page 54: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ConclusionBigDatahasapplicabilityacrossallindustries.TherearethreemajortypesofdatasourcesofBigData.Theyarepeople-peoplecommunications,people-machinecommunications,andmachine-machinecommunications.Eachtypehasmanysourcesofdata.Therearethreetypesofapplications.Theyarethemonitoringtype,theanalysistype,andnewproductdevelopment.Thischapterpresentsafewbusinessapplicationsofeachofthosethreetypes.

Page 55: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ReviewQuestionsQ1:WhatarethemajorsourcesofBigData?Describeasourceofeachtype.

Q2:WhatarethethreemajortypesofBigDataapplications?Describetwoapplicationsofeachtype.

Q3:WoulditbeethicaltoarrestsomeonebasedonaBigDataModel’spredictionofthatpersonlikelytocommitacrime?

Q4:AnautoinsurancecompanylearnedaboutthemovementsofapersonbasedontheGPSinstalledinthevehicle.Woulditbeethicaltousethatasasurveillancetool?

Q5:ResearchcandescribeaBigDataapplicationthathasaprovenreturnoninvestmentforanorganization.

Page 56: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

LibertyStoresCaseExercise:StepB2TheBoardofDirectorsaskedthecompanytotakeconcreteandeffectivestepstobecomeadata-drivencompany.Thecompanywantstounderstanditscustomersbetter.Itwantstoimprovethehappinesslevelsofitscustomersandemployees.Itwantstoinnovateonnewproductsthatitscustomerswouldlike.Itwantstorelateitscharitableactivitiestotheinterestsofitscustomers.

Q1:Whatkindofdatasourcesshouldthecompanycaptureforthis?

Q2:WhatkindofBigDataapplicationswouldyousuggestforthiscompany?

Page 57: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 58: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 59: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Chapter3-BigDataArchitecture

IntroductionBigDataApplicationArchitectureistheconfigurationoftoolsandmodulestoaccomplishthewholetask.Anidealarchitecturewouldberesilient,secure,cost-effective,andadaptivetonewneedsandenvironments.Thisisachievedthroughbeginningwithprovenarchitectures,andcreativelyandprogressivelyrestructuringitwithnewelementsasadditionalneedsandproblemsarise.BigDataarchitecturesultimatelyalignwiththearchitectureoftheUniverse,thesourceofallinvincibility.

Page 60: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CASELET:GoogleQueryArchitectureGoogleinventedthefirstBigDataarchitecture.Theirgoalwastogatheralltheinformationontheweb,organizeit,andsearchitforspecificqueriesfrommillionsofusers.Anadditionalgoalwastofindawaytomonetizethisservicebyservingrelevantandprioritizedonlineadvertisementsonbehalfofclients.

Googledevelopedwebcrawlingagentswhichwouldfollowallthelinksinthewebandmakeacopyofallthecontentonallthewebpagesitvisited.

Googleinventedcost-effective,resilient,andfastwaystostoreandprocessallthatexponentiallygrowingdata.Itdevelopedascale-outarchitectureinwhichitcouldlinearlyincreaseitsstoragecapacitybyinsertingadditionalcomputersintoitscomputingnetwork.Thedatafilesweredistributedoverthelargenumberofmachinesinthecluster.ThisdistributedfilessystemwascalledtheGoogleFilesystem,andwastheprecursortoHDFS.

Googlewouldsortorindexthedatathusgatheredsoitcanbesearchedefficiently.Theyinventedthekey-pairNoSQLdatabasearchitecturetostorevarietyofdataobjects.Theydevelopedthestoragesystemtoavoidupdatesinthesameplace.Thusthedatawaswrittenonce,andreadmultipletimes.

Figure3‑0‑1:GoogleQueryArchitecture

GoogledevelopedtheMapReduceparallelprocessingarchitecturewherebylargedatasetscouldbeprocessedbythousandsofcomputersinparallel,witheachcomputerprocessingachunkofdata,toproducequickresultsfortheoveralljob.

Page 61: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

TheHadoopecosystemofdatamanagementtoolslikeHadoopdistributedfilesystem(HDFS),columnardatabasesystemlikeHBase,aqueryingtoolsuchasHive,andmore,emergedfromGoogle’sinventions.Stormisastreamingdatatechnologiestoproduceinstantresults.LambdaArchitectureisaY-shapedarchitecturethatbranchesouttheincomingdatastreamforbatchaswellasstreamprocessing.

Q1:WhyshouldGooglepublishitsFileSystemandtheMapReduceparallelprogrammingsystemandsenditintoopen-sourcesystem?

Q2:WhatelsecanbedonewithGoogle’srepositoryofalltheweb’sdata?

Page 62: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

StandardBigdataarchitectureHereisthegenericBigDataArchitectureintroducedinChapter1.Therearemanysourcesofdata.Alldataisfunneledinthroughaningestsystem.Thedataisforkedintotwosides:astreamprocessingsystemandabatchprocessingsystem.TheoutcomeoftheseprocessingcanbesentintoNoSQLdatabasesforlaterretrieval,orsentdirectlyforconsumptionbymanyapplicationsanddevices.

Figure3‑0‑2:BigDataApplicationArchitecture

Abigdatasolutiontypicallycomprisestheseaslogicallayers.Eachlayercanberepresentedbyoneormoreavailabletechnologies.

Bigdatasources:Thesourcesofdataforanapplicationdependsuponwhatdataisrequiredtoperformthekindofanalysesyouneed.ThevarioussourcesofBigdataweredescribedinchapter2.Thedatawillvaryinorigin,size,speed,form,andfunction,asdescribedbythe4Vsinchapter1.Datasourcescanbeinternalorexternaltotheorganization.Thescopeofaccesstodataavailablecouldbelimited.Thelevelofstructurecouldbehighorlow.Thespeedofdataanditsquantitywillalsobyhighorlowdependinguponthedatasource.

Dataingestlayer:Thislayerisresponsibleforacquiringdatafromthedatasources.Thedataisthroughascalablesetofinputpointsthatcanacquireatvariousspeedsandinvariousquantities.Thedataissenttoabatchprocessingsystem,astreamprocessingsystem,ordirectlytoastoragefilesystem(suchasHDFS).Complianceregulationsandgovernancepoliciesimpactwhatdatacanbestoredandforhowlong.

BatchProcessinglayer:TheanalysislayerreceivesdatafromtheingestpointorfromthefilesystemorfromtheNoSQLdatabases.Dataisprocessedusingparallelprogrammingtechniques(suchasMapReduce)toprocessitandproducethedesiredresults.Thisbatchprocessinglayerthusneedstounderstandthedatasourcesanddatatypes,thealgorithms

Page 63: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

thatwouldworkonthatdata,andtheformatofthedesiredoutcomes.Theoutputofthislayercouldbesentforinstantreporting,orstoredinaNoSQLdatabasesforanon-demandreport,fortheclient.

StreamingProcessinglayer:Thislayerreceivesdatadirectlyfromtheingestpoint.Dataisprocessedusingparallelprogrammingtechniques(suchasMapReduce)toprocessitinrealtime,andproducethedesiredresults.Thislayerthusneedstounderstandthedatasourcesanddatatypesextremelywell,andthesuper-lightalgorithmsthatwouldworkonthatdatatoproducethedesiredresults.TheoutcomeofthislayertoocouldbestoredintheNoSQLDatabases.

DataOrganizingLayer:Thislayerreceivesdatafromboththebatchandstreamprocessinglayers.Itsobjectiveistoorganizethedataforeasyaccess.ItisrepresentedbyNoSQLdatabases.SQL-likelanguageslikeHiveandPigcanbeusedtoeasilyaccessdataandgeneratereports.

DataConsumptionlayer:Thislayerconsumestheoutputprovidedbytheanalysislayers,directlyorthroughtheorganizinglayer.Theoutcomecouldbestandardreports,dataanalytics,dashboardsandothervisualizationapplications,recommendationengine,onmobileandotherdevices.

InfrastructureLayer:Atbottomthereisalayerthatmanagestherawresourcesofstorage,compute,andcommunication.Thisisincreasinglyprovidedthroughacloudcomputingparadigm.

DistributedFileSystemLayer:ItwouldalsoincludetheHadoopDistributedFileSystem(HDFS).Itwouldalsoincludesupportingapplications,suchasYARN(YetAnotherResourceManager),thatenabletheefficientaccesstodatastorageanditstransfer.

Page 64: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

BigDataArchitectureexamplesEverymajororganizationandapplicationshasauniqueoptimizedinfrastructuretosuititsspecificneeds.HerebelowaresomearchitectureexamplesfromsomeveryprominentusersanddesignersofBigDataapplications.

IBMWatson

IBMWatsonusesSparktomanageincomingdatastreams.ItalsousesSpark’sMachineLearninglibrary(MLLib)toanalyzedataandpredictdiseases.

Netflix

Thisisoneofthelargestprovidersofonlinevideoentertainment.Theyhandle400Billiononlineeventsperday.Asacutting-edgeuserofbigdatatechnologies,theyareconstantlyinnovatingtheirmixoftechnologiestodeliverthebestperformance.Kafkaisthecommonmessagingsystemforallincomingrequests.TheyhosttheentireinfrastructureonAmazonWebServices(AWS).ThedatabaseisAWS’S3aswellasCassandraandHbasetostoredata.Sparkisusedforstreamprocessing.

Page 65: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

(Source:Netflix)

Ebay

Ebayisthesecond-largestEcommercecompanyintheworld.Itdelivers800millionlistingsfrom25millionsellersto160millionbuyers.Tomanagethishugestreamofactivity,EBayusesastackofHadoop,Spark,Kafka,andotherelements.TheythinkthatKafkaisthebestnewthingforprocessingdatastreams.

VMWare

HereisVMware’sviewofaBigDataarchitecture.Itissimilarto,butmoredetailedthan,ourmainbigarchitecturediagram.

Page 66: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

TheWeatherCompany

TheWeathercompanyservesweatherdatagloballythroughwebsitesandmobileapps.ItusesstreamingarchitectureusingApacheSpark.

TicketMaster

Thisistheworld’slargestcompanythatsellseventtickets.Theirgoalistomaketicketsavailabletopurchaseforrealfans,andpreventbadactorsfrommanipulatingthesystemtoincreasethepriceoftheticketsinthesecondarymarkets.

Page 67: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

LinkedIn

Thegoalofthisprofessionalnetworkingcompanyistomaintainanefficientsystemforprocessingthestreamingdataandmakethelinkoptionsavailableinreal-time.

Paypal

Thispayments-facilitationcompanyneedstounderstandandacquirecustomers,andprocessalargenumberofpaymenttransactions.

Page 68: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CERN

Thispremierhigh-energyphysicsresearchlabcomputepetabytesofdatausingin-memorystreamprocessingtoprocessdatafrommillionsofsensorsanddevices.

Page 69: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ConclusionBigDataapplicationsarearchitectedtodostreamaswellasbatchprocessing.Dataisingestedandfedintostreamingandbatchprocessing.MosttoolsusedforbigdataprocessingareopensourcetoolsservedthroughtheApachecommunity,andsomekeydistributorsofthosetechnologies.

Page 70: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ReviewQuestionsQ1:DescribetheBigDataprocessingarchitecture.

Q2:WhatareGoogle’scontributionstoBigdataprocessing?

Q3:WhataresomeofthehottesttechnologiesvisibleinBigDataprocessing?

Page 71: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

LibertyStoresCaseExercise:StepB3ThewantstobuildascalableandfuturisticplatformforitsBigData.

Q1:WhatkindofBigDataProcessingarchitecturewouldyousuggestforthiscompany

Page 72: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 73: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 74: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Section2

ThissectioncoverstheimportantBigDatatechnologiesdefinedintheBigDataarchitecturespecifiedinchapter3.

Chapter4willcoverHadoopanditsDistributedFileSystem(HDFS)

Chapter5willcovertheparallelprocessingalgorithm,MapReduce.

Chapter6willNoSQLdatabasessuchasHBaseandCassandra.ItwillalsocoverPigandHivelanguagesusedforaccessingthosedatabases.

Chapter7willcoverSpark,afastandintegratedstreamingdatamanagementplatform.

Chapter8willcoverDataIngestsystems,usingApacheKafka

Chapter9willcoverCloudComputingmodel.

Page 75: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 76: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 77: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Chapter4:DistributedComputingusingHadoopIntroductionAdistributedsystemisacleverwayofstoringhugequantitiesofdata,securelyandcost-effectively,forspeedandease,forretrievalandprocessing,usinganetworkedcollectionofcommoditymachines.Theidealdistributedfilesystemwouldstoreinfiniteamountsofdatawhilemakingthecomplexitycompletelytransparenttotheuser,andenableeasyaccesstotherightdatainstantly.Thiswouldbeachievedbystoringfragmentsofdataatdifferentlocations,andinternallymanagingthelower-leveltasksofstoringandreplicatingdataacrossthenetwork.ThedistributedsystemultimatelyleadstothecreationoftheunboundedcosmiccomputerthatisalignedwiththeUnifiedFieldofallthelawsofnature.

Page 78: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

HadoopFrameworkTheApacheHadoopdistributedcomputingframeworkiscomposedofthefollowingmodules:

1. HadoopCommon–containslibrariesandutilitiesneededbyotherHadoopmodules

2. HadoopDistributedFileSystem(HDFS)–adistributedfile-systemthatstoresdataoncommoditymachines,providingveryhighaggregatebandwidthacrossthecluster

3. YARN–aresource-managementplatformresponsibleformanagingcomputingresourcesinclustersandusingthemforschedulingofusers’applications,and

4. MapReduce–animplementationoftheMapReduceprogrammingmodelforlargescaledataprocessing.

ThischapterwillcoverHadoopCommon,HDFS,andYARN.ThenextchapterwillcoverMapReduce.

Page 79: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

HDFSDesignGoalsTheHadoopdistributedfilesystem(HDFS)isadistributedandscalablefile-system.Itisdesignedforapplicationsthatdealwithlargedatasizes.Itisalsodesignedtodealwithmostlyimmutablefiles,i.e.writedataonce,butreaditmanytimes.

HDFShasthefollowingmajordesigngoals:

1. Hardwarefailuremanagement–itwillhappen,andonemustplanforit.2. Hugevolume–createcapacityforlargenumberofhugefilesizes,withfast

read/writethroughput3. Highspeed–createamechanismtoprovidelowlatencyaccesstostreaming

applications4. Highvariety–Maintainsimpledatacoherence,bywritingdataoncebutreading

manytimes.5. Open-source–Maintaineasyaccessibilityofdatausinganyhardware,software,

anddatabaseplatform6. Networkefficiency–Minimizenetworkbandwidthrequirement,byminimizing

datamovement

Page 80: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Master-SlaveArchitectureHadoopisanarchitecturefororganizingcomputersinamaster-slaverelationshipthathelpsachievegreatscalabilityinprocessing.AnHDFSclusterhastwotypesofnodesoperatinginamaster−workerpattern:asinglemasternode(calledNameNode),andalargenumberofslaveworkernodes(calledDataNodes).AsmallHadoopclusterincludesasinglemasterandmultipleworkernodes.AlargeHadoopclusterwouldconsistofamasterandthousandsofsmallordinarymachinesasworkernodes.

Figure4‑0‑1:Master-SlaveArchitecture

Themasternodemanagestheoverallfilesystem,itsnamespace,andcontrolstheaccesstofilesbyclients.Themasternodeisawareofthedata-nodes:i.e.whatblocksofwhichfilearestoredonwhichdatanode.Italsocontrolstheprocessingplanforallapplicationsrunningonthedataonthecluster.Thereisonlyonemasternode.Unfortunately,thatmakesitasinglepointoffailure.Therefore,wheneverpossible,themasternodehasahotbackupjustincasethemasternodediesunexpectedly.Themasternodeusesatransactionlogtopersistentlyrecordeverychangethatoccurstofilesystemmetadata.

Theworkernodesstorethedatablocksintheirstoragespace,asdirectedbythemasternode.Eachworkernodetypicalcontainsmanydiskstomaximizestoragecapacityandaccessspeed.Eachworkernodehasitsownlocalfilesystem.Aworkernodehasnoawarenessofthedistributedfilestructure.Itsimplystoreseachblockofdataasdirected,asifeachblockwereaseparatefile.TheDataNodesstoreandserveupblocksofdataoverthenetworkusingablockprotocol,underthedirectionoftheNameNode.

Page 81: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Figure4‑0‑2:HadoopArchitecture(Source:Hadoop.apache.org)

TheNamenodestoresallrelevantinformationaboutalltheDataNodes,andthefilesstoredinthoseDataNodes.TheNameNodewillcontain:

-ForeveryDataNode,itsname,Rack,Capacity,andHealth

-ForeveryFile,itsName,replicas,Type,Size,TimeStamp,Location,Health,etc.

ItaDataNodefails,thereisnoseriousproblem.ThedataonthefaileddataNodewillbeaccessedfromitsreplicasonotherDataNodes.ThefailedDataNodecanbeautomaticallyrecreatedonanothermachine,bywritingallthosefileblocksoffromtheotherhealthyreplicas.Eachdata-nodesendsaheartbeatmessagetothename-nodeperiodically.Withoutthismessage,theDataNodeisassumedtobedead.TheDataNodereplicationeffortwouldautomaticallykick-intoreplacethedeaddata-node.

Thefilesystemhasasetoffeaturesandcapabilitiestocompletelyhidethesplinteringandscatteringofdata,andenabletheusertodealwiththedataatahigh,logicallevel.

TheNameNodetriestoensurethatfilesareevenlyspreadacrossthedata-nodesinthecluster.Thatbalancesthestorageandcomputingload,andalsolimitstheextentoflossfromthefailureofanode.TheNameNodealsotriestooptimizethenetworkingload.Whenretrievingdataororderingtheprocessing,theNameNodetriestopickFragmentsfrommultiplenodestobalancetheprocessingloadandspeedupthetotallyprocessingeffort.TheNameNodealsotriestostorefragmentsoffilesonthesamenodeforspeedofreadandwriting.Processingisdoneonthenodewherethefilefragmentisstored.

Anypieceofdataisstoredtypicallyonthreenodes:twoonthesamerack,andoneona

Page 82: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

differentrack.Datanodescantalktoeachothertorebalancedata,tomovecopiesaround,andtokeepthereplicationofdatahigh.

Page 83: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

BlocksystemHDFSstoreslargefiles(typicallygigabytestoterabytes)bystoringsegments(calledblocks)ofthefileacrossmultiplemachines.AblockofdataisthefundamentalstorageunitinHDFS.Datafilesaredescribed,readandwritteninblock-sizedgranularity.Allstoragecapacityandfilesizesaremeasuredinblocks.Ablockrangesfrom16-128MBinsize,withadefaultblocksizeof64MB.Thus,anHDFSfileischoppedupinto64MBchunks,andifpossible,eachchunkwillresideonadifferentDataNode.

Everydatafiletakesupanumberofblocksdependinguponitssize.Thusa100MBfilewilloccupytwoblocks(100MBdividedby64MB),withsomeroomtospare.Everystoragediskcanaccommodateanumberofblocksdependinguponthesizeofthedisk.Thusa1Terabytestoragewillhave16000blocks(1TBdividedby64MB).

Everyfileisorganizedasaconsecutivelynumberedsequenceofblocks.Afile’sblocksarestoredphysicallyclosetoeachotherforeaseofaccess,asfaraspossible.Thefile’sblocksizeandreplicationfactorareconfigurablebytheapplicationthatwritesthefileonHDFS.

Page 84: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

EnsuringDataIntegrityHadoopensuresthatnodatawillbelostorcorrupted,duringstorageorprocessing.Thefilesarewrittenonlyonce,andneverupdatedinplace.Theycanbereadmanytimes.Onlyoneclientcanwriteorappendtoafile,atatime.Noconcurrentupdatesareallowed.

Ifadataisindeedlostorcorrupted,orifapartofthediskgetscorrupted,anewhealthyreplicaforthatlostblockwillbeautomaticallyrecreatedbycopyingfromthereplicasonotherdata-nodes.Atleastoneofthereplicasisstoredonadata-nodeonadifferentrack.Thisguardsagainstthefailureoftherackofnodes,orthenetworkingrouter,onit.

AchecksumalgorithmisappliedonalldatawrittentoHDFS.Aprocessofserializationisusedtoturnfilesintoabytestreamfortransmissionoveranetworkorforwritingtopersistentstorage.Hadoophasadditionalsecuritybuiltin,usingKerberosverifier.

Page 85: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

InstallingHDFSItispossibletorunHadooponanin-houseclusterofmachines,oronthecloudinexpensively.Asanexample,TheNewYorkTimesused100AmazonElasticComputeCloud(EC2)instances(DataNodes)andaHadoopapplicationtoprocess4TBofrawimageTIFFdatastoredinAmazonSimpleStorageService(S3)into11millionfinishedPDFsinthespaceof24hoursatacomputationcostofabout$240(notincludingbandwidth).SeeChapter9foraprimeronCloudComputing.SeeAppendix1forastep-by-steptutorialoninstallingHadooponAmazonEC2.

HadoopiswritteninJava.HadoopalsorequiresaworkingJavainstallation.InstallingHadooptakesalotofresources.Forexample,allinformationaboutfragmentsoffilesneedstobeinName-nodememory.AthumbruleisthatHadoopneedsapproximately1GBmemorytomanage1Mfilefragments.ManyeasymechanismsexisttoinstalltheentireHadoopstack.UsingaGUIsuchasClouderaResourcesManagertoinstallaClouderaHadoopstackiseasy.Thisstackincludes,HDFS,andmanyotherrelatedcomponents,suchasHBase,Pig,YARN,andmore.InstallingitonaclusteronacloudservicesproviderlikeAWSiseasierthaninstallingJavaVirtualMachines(JVMs)onHDFScanbeinstalledbyusingClouderaGUIResourcesManager.Ifdoingfromcommandline,downloadHadoopfromoneoftheApachemirrorsites

HadoopiswritteninJava.AndmostaccesstofilesisprovidedthroughJavaabstractclassorg.apache.hadoop.fs.FileSystem.HDFScanbemounteddirectlywithaFilesysteminUserspace(FUSE)virtualfilesystemonLinuxandsomeotherUnixsystems.FileaccesscanbeachievedthroughthenativeJavaapplicationprogramminginterface(API).AnotherAPI,calledThrift,helpstogenerateaclientinthelanguageoftheusers’choosing(suchasC++,Java,Python).WhentheHadoopcommandisinvokedwithaclassnameasthefirstargument,itlaunchesaJavavirtualmachine(JVM)toruntheclass,alongwiththerelevantHadooplibraries(andtheirdependencies)ontheclasspath.

HDFShasaUNIX-likecommandlikeinterface(CLI).UseshshelltocommunicatewithHadoop.HDFShasUNIX-likepermissionsmodelforfilesanddirectories.Therearethreeprogressivelyincreasinglevelsofpermissions:read(r),write(w),andexecute(x).Createahduser,andcommunicateusingsshshellonthelocalmachine.

%hadoopfs-help##getdetailedhelponeverycommand.

ReadingandWritingLocalFilesintoHDFS

Therearetwodifferentwaystotransferdata:fromthelocalfilesystem,orforman

Page 86: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

input/outputstream.CopyingafilefromthelocalfilesystemtoHDFScanbedoneby:

%hadoopfs-copyFromLocalpath/filename

ReadingandWritingDataStreamsintoHDFS

ReadafilefromHDFSbyusingajava.net.URLobjecttoopenastreamtoreadthedatarequiresashortscript,asbelow.

InputStreamin=null;

Start{

instream=newURL(“hdfs://host/path”).openStream();//detailsofprocessin}

Finish{IOUtils.closeStream(instream);}

Asimplemethodtocreateanewfileisasfollows:

publicFSDataOutputStreamcreate(Pathp)throwsIOException

Datacanbeappendedtoanexistingfileusingtheappend()method:

publicFSDataOutputStreamappend(Pathp)throwsIOException

Adirectorycanbecreatedbyasimplemethod:

publicbooleanmkdirs(Pathp)throwsIOException

Listthecontentsofadirectoryusing:

publicFileStatus[]listStatus(Pathp)throwsIOException

publicFileStatus[]listStatus(Pathp,PathFilterfilter)throwsIOException

Page 87: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

SequenceFilesTheincomingdatafilescanrangefromverysmalltoextremelylarge,andwithdifferentstructures.BigDatafilesarethereforeorganizedquitedifferentlytohandlethediversityoffilesizesandtype.LargefilesarestoredasHDFSfiles,withFileFragmentsdistributedacrossthecluster.However,smallerfilesshouldbebunchedtogetherintosinglesegmentforefficientstorage.

SequenceFilesareaspecializeddatastructurewithinHadooptohandlesmallerfileswithsmallerrecordsizes.SequenceFileusesapersistentdatastructurefordataavailableinkey-valuepairformat.Thesehelpefficientlystoresmallerobjects.HDFSandMapReducearedesignedtoworkwithlargefiles,sopackingsmallfilesintoaSequenceFilecontainer,makesstoringandprocessingthesmallerfilesmoreefficientforHDFSandMapReduce.

Sequencefilesarerow-orientedfileformats,whichmeansthatthevaluesforeachrowarestoredcontiguouslyinthefile.Thisformatsareappropriatewhenalargenumberofcolumnsofasinglerowareneededforprocessingatthesametime.Thereareeasycommandstocreate,readandwriteSequenceFilestructures.SortingandmergingSequenceFilesisnativetoMapReducesystem.AMapFileisessentiallyasortedSequenceFilewithanindextopermitlookupsbykey.

Page 88: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

YARNYARN(YetAnotherResourceNegotiator)isthearchitecturalcenterofHadoop,Itisoftencharacterizedasalarge-scale,distributedoperatingsystemforbigdataapplications.YARNmanagesresourcesandmonitorsworkloads,inasecuremulti-tenantenvironment,whileensuringhighavailabilityacrossmultipleHadoopclusters.YARNalsobringsgreatflexibilityasacommonplatformtorunmultipletoolsandapplicationssuchasinteractiveSQL(e.g.Hive),real-timestreaming(e.g.Spark),andbatchprocessing(MapReduce),toworkondatastoredinasingleHDFSstorageplatform.Itbringsclustersmorescalabilitytoexpandbeyond1000nodes,italsoimprovesclusterutilizationthroughdynamicallocationofclusterresourcestovariousapplications.

Figure4‑0‑3:HadoopDistributedArchitectureincludingYARN

TheResourceManagerinYARNhastwomaincomponents:SchedulerandApplicationsManager.

YARNSchedulerallocatesresourcestothevariousrequestingapplications.ItdoessobasedonanabstractnotionofaresourceContainerwhichincorporateselementssuchasMemory,CPU,Diskstorage,Network,etc.EachmachinealsohasaNodeManagerthatmanagesalltheContainersonthatmachine,andreportsstatusonresourcesandContainerstotheYARNScheduler.

YARNApplicationsManageracceptsnewjobsubmissionsfromtheclient.ItthenrequestsafirstresourceContainerfortheapplication-specificApplicationMasterprogram,andmonitorsthehealthandexecutionoftheapplication.Oncerunning,theApplicationMasterdirectlynegotiatesadditionalresourcecontainersfromtheSchedulerasneeded.

Page 89: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ConclusionHadoopisthemajortechnologyformanagingbigdata.HDFSsecurelystoresdataonlargeclustersofcommoditymachines.Amastermachinecontrolsthestorageandprocessingactivitiesoftheworkermachines.ANameNodecontrolsthenamespaceandstorageinformationforthefilesystemontheDataNodes.AmasterJobTrackercontrolstheprocessingoftasksattheDataNodes.YARNistheresourcesmanagerthatmanagesallresourcesdynamicallyandefficientlyacrossallapplicationsonthecluster.HadoopFilesystemandotherpartsoftheHadoopstackaredistributedbymanyvendors,andcanbeeasilyinstalledoncloudcomputinginfrastructure.HadoopinstallationtutorialisinAppendixA.

Page 90: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ReviewQuestionsQ1:HowdoesHadoopdifferfromatraditionalfilesystem?

Q2:WhatarethedesigngoalsforHDFS?

Q3:HowdoesHDFSensuresecurityandintegrityofdata?

Q4:Howdoesamasternodedifferfromtheworkernode?

Page 91: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 92: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 93: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Chapter5–ParallelProcessingwithMapReduce

Introduction

Aparallelprocessingsystemisacleverwaytoprocesshugeamountsofdatainashortperiodoftimebyenlistingtheservicesofmanycomputingdevicestoworkonpartsofthejob,simultaneously.Theidealparallelprocessingsystemwillworkacrossanycomputationalproblem,usinganynumberofcomputingdevices,acrossanysizeofdatasets,witheaseandhighprogrammerproductivity.Thisisachievedbyframingtheprobleminawaythatitcanbebrokendownintomanyparts,suchthatthateachpartcanbepartiallyprocessedindependentlyoftheotherparts;andthentheintermediateresultsfromprocessingthepartscanbecombinedtoproduceafinalsolution.Infiniteparallelprocessingistheessenceofinfinitedynamismofthelawsofnature.

Page 94: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

MapReduceOverview

MapReduceisaparallelprogrammingframeworkforspeedinguplargescaledataprocessingforcertaintypesoftasks.ItachievessowithminimalmovementofdataondistributedfilesystemssuchasHDFSclusters,toachievenear-realtimeresults.Therearetwomajorpre-requisitesforMapReduceprogramming.(a)Theapplicationmustlenditselftoparallelprogramming.(b)Thedatacanbeexpressedinkey-valuepairs.

MapReduceprocessingissimilartoUNIXsequence(alsocalledpipe)structure

e.g.theUNIXcommand:

grep|sort|countmyfile.txt

willproduceawordcountinthetextdocumentcalledmyfile.txt.

Therearethreecommandsinthissequence,andtheyworkasfollows:(a)grepiscommandtoreadthetextfileandcreateanintermediatefilewithonewordonaline;(b)sortcommandwillsortthatintermediatefile,andproduceanalphabeticallysortedlistofwordsinthatset;(c)thecountcommandwillworkonthatsortedlist,toproducethenumberofoccurrencesofeachword,anddisplaytheresultstotheuserina“word,frequency”pairformat.

Forexample:Supposemyfile.txtcontainsthefollowingtext:

Myfile:Wearegoingtoapicnicnearourhouse.Manyofourfriendsarecoming.Youarewelcometojoinus.Wewillhavefun.

TheoutputsofGrep,SortandWordcountwillasshownbelow.

Grep Sort WordCount

We a a 1

are are are 3

going are coming 1

to are friends 1

a coming fun 1

picnic friends going 1

Page 95: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

near fun have 1

our going house 1

house have join 1

Many house many 1

of join near 1

our many of 1

friends near our 2

are of picnic 1

coming our to 2

You our us 1

are picnic we 2

welcome to welcome 1

to to will 1

join us you 1

us We

we we

will welcome

have will

fun you

Ifthefileisverylarge,thenitwillbetakethecomputeralongtimetoprocessit.Parallelprocessingcanhelphere.

MapReducespeedsupthecomputationbyreadingandprocessingsmallchunksoffile,bydifferentcomputersinparallel.Thusifafilecanbebrokendowninto100smallchunks,

Page 96: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

eachchunkcanbeprocessedataseparatecomputerinparallel.Thetotaltimetakentoprocessthefilecouldbe1/100ofthetimetakenotherwise.However,nowtheresultsofthecomputationonsmallchunksareresidingina100differentplaces.Theselargenumberofpartialresultsneedtobecombinedtoproduceacompositeresult.TheresultsoftheoutputsfromvariouschunkswillbecombinedbyanotherprogramcalledtheReduceprogram.

TheMapstepwilldistributethefulljobintosmallertasksthatcanbedoneonseparatecomputerseachusingonlyapartofthedataset.TheresultoftheMapstepwillbeconsideredasintermediateresults.TheReducestepwillreadtheintermediateresults,andwillcombineallofthemandproducethefinalresult.Theprogrammerneedstospecifiesthefunctionallogicforboththemapandreducesteps.Thesorting,betweentheMapandReducesteps,doesnotneedtobespecifiedandisautomaticallytakencareoftheMapReducesystemasastandardserviceprovidedtoeveryjob.Thesortingofthedatarequiresafieldtosorton.Thustheintermediateresultsneedtohavesomekindofakeyfield,andasetofassociatednon-keyattribute(s)forthatkey.

Figure5‑0‑1:MapReduceArchitecture

Inpractice,tomanagethevarietyofdatastructuresstoredinthefilesystem,dataisstoredasonekeyandonenon-keyattribute.Thusthedataisrepresentedasakey-valuepair.Theintermediateresults,andthefinalresultsallwillalsobeinkey-pairformat.ThusakeyrequirementfortheuseofMapReduceparallelprocessingsystemisthattheinputdataandoutputdatamustbothberepresentedinkey-valuesformats.

Mapstepreadsdatainkey-valuepairformat.Theprogrammerdecidewhatshouldbethecharacteristicsofthekeyandvaluefields.TheMapstepproducesresultsinkey-valuepairformat.However,thecharacteristicsofthekeysproducedbytheMapstep,i.e.theintermediateresults,neednotbesamekeysattheinputdata.So,thosecanbecalledkey2-value2pairs.

TheReducestepreadsthekey2-value2pairs,theintermediateresultsproducedbytheMapstep.Reducestepwillproduceanoutputusingthesamekeysthatitread.Onlythevaluesassociatedwiththosekeyswillchangethoughasaresultofprocessing.Thusitcan

Page 97: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

belabeledaskey2-value3format.

Supposethetextinthemyfile.txtcanbesplitinto4approximatelyequalsegments.Itcouldbedonewitheachsentenceasaseparatepieceoftext.Thefoursegmentswilllookasfollowing:

Segment1:Wearegoingtoapicnicnearourhouse.

Segment2:Manyofourfriendsarecoming.

Segment3:Youarewelcometojoinus.

Segment4:Wewillhavefun.

Thustheinputtothe4processorsintheMapStepwillbeinkey-valuepairformat.Thefirstcolumnisthekey,whichistheentiresentenceinthiscase.Thesecondcolumnisthevalue,whichinthisapplicationisthefrequencyofthesentence.

Wearegoingtoapicnicnearourhouse. 1

Manyofourfriendsarecoming. 1

Youarewelcometojoinus. 1

Wewillhavefun. 1

Thistaskcanbedoneinparallelbyfourprocessors.Eachofthissegmentwillbetaskforadifferentprocessor.Thuseachtaskwillproduceafileofwords,withacountof1.Therewillbefourintermediatefiles,in<key,value>pairformat,shownbelow.

Key2 Value2 Key2 Value2 Key2 Value2 Key2 Value2

we 1 many 1 you 1 we 1

are 1 of 1 are 1 will 1

going 1 our 1 welcome 1 have 1

Page 98: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

to 1 friends 1 to 1 fun 1

a 1 are 1 join 1

picnic 1 coming 1 us 1

near 1

our 1

house 1

ThesortprocessinherentwithinMapReducewillsorteachoftheintermediatefiles,andproducethefollowingsortedkey-pairvalues:

Key2 Value2 Key Value2 Key Value2 Key Value2

a 1 are 1 are 1 fun 1

are 1 coming 1 join 1 have 1

going 1 friends 1 to 1 we 1

house 1 many 1 us 1 will 1

near 1 of 1 welcome 1

our 1 our 1 you 1

picnic 1

to 1

we 1

TheReducefunctionwillreadthesortedintermediatefiles,andcombinethecountsforalltheuniquewords,toproducethefollowingoutput.Thekeysremainthesameasintheintermediateresults.However,thevalueschangeascountsfromeachoftheintermediatefilesareaddedupforeachkey.Forexample,thecountfortheword‘are’goesupto3.

Key2 Value3

Page 99: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

a 1

are 3

coming 1

friends 1

fun 1

going 1

have 1

house 1

join 1

many 1

near 1

of 1

our 2

picnic 1

to 2

us 1

we 2

welcome 1

will 1

you 1

ThisoutputwillbeidenticaltothatproducedbytheUNIXsequenceearlier.

Page 100: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

MapReduceprogrammingAdataprocessingproblemneedstobetransformedintotheMapReducemodel.Thefirststepistovisualizetheprocessingplanintoamapandareducestep.Whentheprocessinggetsmorecomplex,thiscomplexitycanbegenerallymanifestedinhavingmoreMapReducejobs,ormorecomplexmapandreducejobs.HavingmorebutsimplerMapReducejobsleadstomoreeasilymaintainablemapperandreducerprograms.

MapReduceDataTypesandFormats

MapReducehasasimplemodelofdataprocessing:inputsandoutputsforthemapandreducefunctionsarekey-valuepairs.ThemapandreducefunctionsinHadoopMapReducehavethefollowinggeneralform:

map:(K1,V1)→list(K2,V2)

reduce:(K2,list(V2))→list(K3,V3)

Ingeneral,themapinputkeyandvaluetypes(K1andV1)aredifferentfromthemapoutputtypes(K2andV2).However,thereduceinputmusthavethesametypesasthemapoutput,althoughthereduceoutputtypesmaybedifferentagain(K3andV3).SinceMapperandReducerareseparateclasses,thetypeparametershavedifferentscopes,

Hadoopcanprocessmanydifferenttypesofdataformats,fromflattextfilestodatabases.Aninputsplitisachunkoftheinputthatisprocessedbyasinglemap.Eachmapprocessesasinglesplit.Eachsplitisdividedintorecords,andthemapprocesseseachrecord—akey-valuepair—inturn.Splitsandrecordsarelogical:andmaymaptoafullfile,apartofafile,oracollectionoffiles.Inadatabasecontext,asplitmightcorrespondtoarangeofrowsfromatableandarecordtoarowinthatrange

WritingMapReduceProgramming

Startbywritingpseudocodeforthemapandreducefunctions.TheprogramcodeforboththemapandthereducefunctioncanthenbewritteninJavaorotherlanguages.InJava,themapfunctionisrepresentedbythegenericMapperclass.Itusesfourparameters:inputkey,inputvalue,outputkey,outputvalue.Thisclassusesanabstractmap()method.Thismethodreceivedtheinputkeyandinputvalue.Itwouldnormallyproduceandoutputkeyandoutputvalue.Formorecomplexproblems,itisbettertouseahigher-levellanguagethanMapReduce,suchasPig,Hive,Cascading,Crunch,orSpark.

Amappercommonlyperformsinputformatparsing,projection(selectingtherelevantfields),andfiltering(selectingtherecordsofinterest).Thereducertypicallycombines

Page 101: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

(addsoraverages)thosevalues.

Figure5‑0‑2:MapReduceprogramFlow

Herebelowisthestep-by-steplogicImaginethatwewanttodoawordcountofalluniquewordsinatext.

1. Thebigdocumentissplitintomanysegments.Themapstepisrunoneachsegmentofdata.Theoutputwillbeasetofkey,valuepairs.Inthiscase,thekeywillbeawordinthedocument.

2. Thesystemwillgatherthekey,valuepairoutputsfromallthemappers,andwillsortthembykey.Thesortedlistitselfmaythenbesplitintoafewsegments.

3. AReducertaskwillreadthesortedlistandproduceacombinedlistofwordcounts.

HereistheJavacodeforwordcount:.

map(Stringkey,Stringvalue):

foreachwordwinvalue:

EmitIntermediate(w,“1”);

reduce(Stringkey,Iteratorvalues):

intresult=0;

foreachvinvalues:

result+=ParseInt(v);

Emit(AsString(result));

TestingMapReducePrograms

Mapperprogramsrunningonaclustercanbecomplicatedtodebug.Thetime-honored

Page 102: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

wayofdebuggingprogramsisviaprintstatements.However,withtheprogramseventuallyrunningontensorthousandsofnodes,itisbesttodebugtheprogramsinstages.Therefore,runtheprogramusingsmallsampledatasetstoensurethattheprogramisworkingcorrectly.Expandtheunitteststocoverlargerdatasetandrunitonacluster.Ensurethatthemapperorreducercanhandletheinputscorrectly.Runningagainstthefulldatasetislikelytoexposesomemoreissues,whichshouldbefixed,byalteringyourmapperorreducertohandlethenewcases.Aftertheprogramisworking,theprogrammaybetunedtomaketheentireMapReducejobrunfaster.

Itmaybedesirabletosplitthelogicintomanysimplemappersandchainingthemintoasinglemapperusingafacility(theChainMapperlibraryclass)builtintoHadoop.Itcanrunachainofmappers,followedbyareducerandanotherchainofmappers,inasingleMapReducejob.

Page 103: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

MapReduceJobsExecution

AMapReducejobisspecifiedbytheMapprogramandtheReduceprogram,alongwiththedatasetsassociatedwiththatjob.ThereisanothermasterprogramthatresidesandrunsendlesslyontheNameNode.ItiscalledtheJobtracker,andittrackstheprogressoftheMapReducejobsfrombeginningtothecompletion.Hadoopdividesthejobintotwotasks:maptasksandreducetasks.HadoopmovestheMapandReducecomputationlogictoeachDataNodethatishostingapartofthedata.ThecommunicationbetweenthenodesisaccomplishedusingYARN,Hadoop’snativeresourcemanager.

Themastermachine(NameNode)iscompletelyawareofthedatastoredoneachoftheworkermachines(DataNodes).Itschedulesthemaporreducejobstotasktrackerswithfullawarenessofthedatalocation.Forexample:ifnodeAcontainsdata(x,y,z)andnodeBcontainsdata(a,b,c),thejobtrackerschedulesnodeBtoperformmaporreducetaskson(a,b,c)andnodeAwouldbescheduledtoperformmaporreducetaskson(x,y,z).Thisreducesthedatatrafficandpreventschokingofthenetwork.

EachDataNodehasamasterprogramcalledtheJobtracker.ThisprogrammonitorstheexecutionofeverytaskassignedtoitbytheNameNode.Whenthetaskiscompleted,theTasktrackersendsacompletionmessagetotheJobTrackerprogramonthe

Thejobsandtasksworkinamaster-slavemode.

Figure5‑0‑3:HierarchicalMonitoringArchitecture

WhenthereismorethanonejobinaMapReduceworkflow,itisnecessarytheybeexecutedintherightorder.Foralinearchainofjobsitmightbeeasy.Foramorecomplexdirectedacyclicgraph(DAG)ofjobs,therearelibrariesthatcanhelporchestrateyourworkflow.OronecanuseApacheOozie,asystemforrunningworkflowsofdependentjobs.

Oozieconsistsoftwomainparts:aworkflowenginethatstoresandrunsworkflowscomposedofdifferenttypesofHadoopjobs(MapReduce,Pig,Hive,andsoon),andacoordinatorenginethatrunsworkflowjobsbasedonpredefinedschedulesanddata

Page 104: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

availability.Ooziehasbeendesignedtoscale,anditcanmanagethetimelyexecutionofthousandsofworkflowsinaHadoopcluster.

ThedatasetfortheMapReducejobisdividedintofixed-sizepiecescalledinputsplits,orjustsplits.Hadoopcreatesonemaptaskforeachsplit,whichrunstheuser-definedmapfunctionforeachrecordinthesplit.ThetasksarescheduledusingYARNandrunonnodesinthecluster.YARNensuresthatifataskfailsorinordinatelydelayed,itwillbeautomaticallyscheduledtorunonadifferentnode.Theoutputsofthemapjobsarefedasinputtothereducejob.Thatlogicisalsopropagatedtothenode(s)thatwilldothereducejobs.Tosaveonbandwidth,Hadoopallowstheuseofacombinerfunctiononthemapoutput.Thenthecombinerfunction’soutputformstheinputtothereducefunction.

HowMapReduceWorks

AMapReducejobcanbeexecutedwithasinglemethodcall:submit()onaJobobject.WhentheresourcemanagerreceivesacalltoitssubmitApplication()method,ithandsofftherequesttotheYARNscheduler.Theschedulerallocatesacontainer,andtheresourcemanagerthenlaunchestheapplicationmaster’sprocess.TheapplicationmasterforMapReducejobsisaJavaapplicationwhosemainclassisMRAppMaster.Itinitializesthejobbycreatinganumberofbookkeepingobjectstokeeptrackofthejob’sprogress.Itretrievestheinputsplitscomputedintheclientfromthesharedfilesystem.Itthencreatesamaptaskobjectforeachsplit,aswellasanumberofreducetaskobjectsdeterminedbythemapreduce.job.reducesproperty(setbythesetNumReduceTasks()methodonJob).TasksaregivenIDsatthispoint.TheapplicationmastermustdecidehowtorunthetasksthatmakeuptheMapReducejob.Theapplicationmasterrequestscontainersforallthemapandreducetasksinthejobfromtheresourcemanager.Onceataskhasbeenassignedresourcesforacontaineronaparticularnodebytheresourcemanager’sscheduler,theapplicationmasterstartsthecontainerbycontactingthenodemanager.ThetaskisexecutedbyaJavaapplicationwhosemainclassisYarnChild.

ManagingFailures

Therecanbefailuresattheleveloftheentirejoborparticulartasks.Theentireapplicationmasteritselfcouldfail.

Taskfailureusuallyhappenswhentheusercodeinthemaporreducetaskthrowsaruntimeexception.Ifthishappens,thetaskJVMreportstheerrortoitsparentapplicationmaster,whereitisloggedintoerrorlogs.Theapplicationmasterwillthenrescheduleexecutionofthetaskonanotherdatanode.

Page 105: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Theentirejob,i.e.MapReduceapplicationmasterapplicationrunningonYARN,toocanfail.Inthatcase,itisstartedagain,subjecttoamaximumnumberwhichisauser-setconfigurationparameter.

Ifadatanodemanagerfailsbycrashingorrunningveryslowly,itwillstopsendingheartbeatstotheresourcemanager(orsendthemveryinfrequently).Theresourcemanagerwillthenremoveitfromitspoolofnodestoschedulecontainerson.Anytaskorapplicationmasterrunningonthefailednodemanagerwillberecoveredusingerrorlogs,andstartedonothernodes.

ResourceManagerYARNcanalsofail,andithasmoresevereconsequencesfortheentirecluster.Therefore,typically,therewillbeahot-standbyforYARN.Iftheactiveresourcemanagerfails,thenthestandbycantakeoverwithoutasignificantinterruptiontotheclient.Thenewresourcemanagercanreadtheapplicationinformationfromthestatestore,andthenrestarttheapplicationthatwererunningonthecluster.

ShuffleandSort

MapReduceguaranteesthattheinputtoeveryreducerissortedbykey.Theprocessbywhichthesystemperformsthesort—andtransfersthemapoutputstothereducersasinputs—isknownastheshuffle.

Whenthemapfunctionstartsproducingoutput,itisnotdirectlywrittentodisk.Thetakesadvantageofbufferingwritesinmemoryanddoingsomepresortingforefficiencyreasons.Eachmaptaskhasacircularmemorybufferthatitwritestheoutputto.Beforeitwritestodisk,thethreadfirstdividesthedataintopartitionscorrespondingtothereducersthattheywillultimatelybesentto.Withineachpartition,thebackgroundthreadperformsanin-memorysortbykey.Ifthereisacombinerfunction,itisrunontheoutputofthesortsothatthereislessdatatotransfertothereducer.

Thereducetaskneedsthemapoutputforitsparticularpartitionfromseveralmaptasksacrossthecluster.Themaptasksmayfinishatdifferenttimes,sothereducetaskstartsreadingtheiroutputsassoonaseachcompletes.Whenallthemapoutputshavebeenread,thereducetaskmergesthemapoutputs,maintainingtheirsortordering.Thereducefunctionisinvokedforeachkeyinthesortedoutput.TheoutputofthisphaseiswrittendirectlytotheoutputfilesystemsuchasHDFS.

ProgressandStatusUpdates

MapReducejobsarelong-runningbatchjobs,takingalongtimetorun.Itisimportantfor

Page 106: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

theusertogetfeedbackonhowthejob’sprogress.Ajobandeachofitstaskshaveastatusvalue(e.g.,running,successfullycompleted,failed),theprogressofmapsandreduces,thevaluesofthejob’scounters.Thesevaluesareconstantlycommunicatedbacktotheclient.Whentheapplicationmasterreceivesanotificationthatthelasttaskforajobiscomplete,itchangesthestatusforthejobto“successful.”Jobstatisticsandcountersarecommunicatedtotheuser.

Hadoopcomeswithanativeweb-basedGUIfortrackingtheMapReducejobs.Itdisplaysusefulinformationaboutajob’sprogresssuchashowmanytaskshavebeencompleted,andwhichonesarestillbeingexecuted.Oncethejobiscompleted,onecanviewthejobstatisticsandlogs.

Page 107: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

HadoopStreamingHadoopStreamingusesstandardUnixstreamsastheinterfacebetweenHadoopanduserprogram.Streamingisanidealapplicationfortextprocessing.Mapinputdataispassedoverstandardinputtoyourmapfunction,whichprocessesitlinebylineandwriteslinestostandardoutput.Amapoutputkey-valuepairiswrittenasasingletab-delimitedline.Inputtothereducefunctionisinthesameformat—atab-separatedkey-valuepair—passedoverstandardinput.Thereducefunctionreadslinesfromstandardinput,whichtheframeworkguaranteesaresortedbykey,andwritesitsresultstostandardoutput.

Page 108: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Conclusion

MapReduceisthefirstpopularparallelprogrammingframeworkforBigData.Itworkswellforapplicationswherethedatacanbelarge,anddivisibleintoseparatesets,andrepresentedin<key,value>pairformat.Theapplicationlogicisdividedintotwoparts:aMapprogramandaReduceProgram.Eachoftheseprogramscanberuninparallelbyseveralmachines.

Page 109: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ReviewQuestions

Q1:WhatisMapReduce?Whatareitsbenefits?

Q2:Whatisthekey-valuepairformat?Howisitdifferentfromotherdatastructures?Whatareitsbenefits?Andlimitations.

Page 110: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 111: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 112: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Chapter6–NoSQLdatabasesANoSQLdatabaseisacleverwaytocost-effectivelyorganizelargeamountsofheterogeneousdataforefficientaccessandupdates.TheidealNoSQLdatabaseiscompletelyalignedwiththenatureoftheproblemsbeingsolved,andissuperfastinthattask.Thisisachievedbyreleasingandrelaxingmanyoftheintegrityandredundancyconstraintsofstoringdatainrelationaldatabases,andstoringdatainmanyinnovativeformatsasalignedwithbusinessneed.ThediverseNoSQLdatabaseswillultimatelycollectiveevolveintoaholisticsetofefficientandelegantdatastructuresattheheartofacosmiccomputerofinfiniteorganizationcapacity.

Page 113: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

IntroductionRelationaldatamanagementsystems(RDBMS)areapowerfulanduniversallyuseddatabasetechnologybyalmostallenterprises.Relationaldatabasesarestructuredandoptimizedtoensureaccuracyandconsistencyofdata,whilealsoeliminatinganyredundancyofdata.Thesedatabasesarestoredonthelargestandmostreliableofcomputerstoensurethatthedataisalwaysavailableatagranularlevelandatahighspeed.

Bigdataishoweveramuchlargerandunpredictablestreamofdata.Relationaldatabasesareinadequateforthistask,andwillalsobeveryexpensiveforsuchlargedatavolumes.Managingthecostsandspeedofmanagingsuchlargeandheterogeneousdatastreamsrequiresrelaxingmanyofthestrictrulesandrequirementsofrelationaldata.Dependinguponwhichconstraint(s)arerelaxed,adifferentkindofdatabasestructurewillemerge.ThesearecalledNoSQLdatabases,todifferentiatethemfromrelationaldatabasesthatuseStructuredQueryLanguage(SQL)astheprimarymeanstomanipulatedata.

NoSQLdatabasesarenext-generationdatabasesthatarenon-relationalintheirdesign.ThenameNoSQLismeanttodifferentiateitfromantiquated,‘PRE-relational’databases.Today,almosteveryorganizationthatneedstogathercustomerfeedbackandsentimentstoimprovetheirbusiness,willuseaNoSQLdatabase.NoSQLisusefulwhenanenterpriseneedstoaccess,analyzeandutilizemassiveamountsofeitherstructuredorunstructureddataordatathat’sstoredremotelyinanyvirtualserveracrosstheglobe.

Theconstraintsofarelationaldatabasearerelaxedinmanyways.Forexample,relationaldatabasesrequirethatanydataelementcouldberandomlyaccessedanditsvaluecouldbeupdatedinthatsamephysicallocation.However,thesimplephysicsofstoragesaysthatitissimplerandfastertoreadorwritesequentialblocksofdataonadisk.Therefore,NoSQLdatabasefilesarewrittenonceandalmostneverupdatedinplace.Ifanewversionofapartofthedatabecomeavailable,itwouldbestoredelsewherebythesystem.Thesystemwouldhavetheintelligencetolinktheupdateddatatotheolddata.

PigandHivearetwokeyandpopularlanguagesintheHadoopecosystemthatworkswellonNoSQLdatabases.PigoriginatedatYahoowhileHiveoriginatedatFacebook.BothPigandHivecanusethesamedataasaninput,andcanachievesimilarresultswithqueries.BothPigLatinandHivecommandseventuallycompiletoMapandReducejobs.Theyhaveasimilargoal-toeasethecomplexityofwritingcomplexjavaMapReduceprograms.MostMapReducejobscanbeimplementedeasilyinHiveorPig.

Page 114: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Foranalyticalneeds,HiveispreferableoverPig.Forcontrolledprocessing,Pig’sscriptingdesignispreferableHiveleadstoeaseandproductivityusingitsSQLlikedesignanduserinterface.Pigoffersgreatercontroloverdataflows.JavaMRcanbeusedformoreadvancedAPIstoaccomplishthingswhenthereissomethingspecialneeded,suchasinteractingwithathird-partytool,orsomespecialdatacharacteristics.

Page 115: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

RDBMSVsNoSQLTheyaredifferentinmanyways.First,NoSQLdatabases,donotsupportrelationalschemaortheSQLlanguage.ThetermNoSQLstandsmostlyfor“NotonlySQL”.Second,theirtransactionprocessingcapabilitiesarefastbutweak,andtheydonotsupporttheACID(Atomicity,Consistency,Isolation,Durability)propertiesassociatedwithtransactionprocessingusingrelationaldatabases.Instead,theyareapproximatelyaccurateatanypointintime,andwillbeeventuallyconsistent.Third,thesedatabasesarealsodistributedandhorizontallyscalabletomanageweb-scaledatabasesusingHadoopclustersofstorage.Thustheyworkwellwiththewrite-once,read-manystoragemechanismofHadoopclusters.

Feature RDBMS NoSQL

Applications MostlycentralizedApplications(e.g.ERP)

Mostlydesignedforthedecentralizedapplications(e.g.Web,mobile,sensors)

Availability Moderatetohigh Continuousavailabilitytoreceiveandservedata

Velocity Moderatevelocityofdata Highvelocityofdata(devices,sensors,socialmedia,etc.).Lowlatencyofaccess.

DataVolume Moderatesize;archivedafterforacertainperiod

Hugevolumeofdata,storedmostlyforalongtimeorforever;LinearlyscalableDB.

DataSources Dataarrivesfromoneorfew,mostlypredictablesources

Dataarrivesfrommultiplelocationsandareofunpredictablenature

Datatype Dataaremostlystructured Structuredorunstructureddata

DataAccess Primaryconcernisreadingthedata

Concernisbothreadandwrite

Technology Standardizedrelationalschemas;SQLlanguage

Manydesignswithmanyimplementationsofdatastructuresandaccesslanguages

Cost Expensive;commercial Low;open-sourcesoftware

Page 116: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 117: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

TypesofNoSQLDatabasesThevarietyofbigdatameansthatfilesizeandtypeswillvaryenormously.Therearespecializeddatabasestosuitdifferentpurposes.

1. DocumentDatabases:Storinga10GBvideomoviefileasasingleobjectcouldbespeededupbysequentiallystoringthedataincontiguousblocksofphysicalstorage.Anindexcouldstoretheidentifyinginformationaboutthemovie,andtheaddressofthestartingblock.Therestofstoragedetailscouldbehandledbythesystem.Thisstorageformatwouldbeacalleddocumentstoreformat.Theindexwouldcontainthenameofthemovie,andthevalueistheentirevideofile,characterizedbythefirstblockofstorage.Documentdatabasesaregenerallyusefulforcontentmanagementsystems,bloggingplatforms,webanalytics,real-timeanalytics,ecommerce-applications.Wewouldavoidusingdocumentdatabasesforsystemsthatneedcomplextransactionsspanningmultipleoperationsorqueriesagainstvaryingaggregatestructures.

2. Key-ValuePairDatabases:Therecouldbeacollectionofmanydataelementssuchasacollectionoftextmessageswhichcouldalsofitintoasinglephysicalblockofstorage.Eachtextmessageisauniqueobject.Thisdatawouldneedtobequeriedoften.Thatcollectionofmessagescouldalsobestoredinakey-valuepairformat,bycombiningtheidentifierofthemessageandthecontentofthemessage.Key-valuedatabasesareusefulforstoringsessioninformation,userprofiles,preferences,andshoppingcartdata.Key-valuedatabasesdon’tworksowellwhenweneedtoquerybynon-keyfieldsoronmultiplekeyfieldsatthesametime.

3. GraphDatabases:Geographicmapdatathatisstoredinsetofrelationshipsorlinksbetweenpoints.Graphdatabasesareverywellsuitedtoproblemspaceswherewehaveconnecteddata,suchassocialnetworks,spatialdata,routinginformation,andrecommendationengines.

4. ColumnarDatabases:Somekindofdatabasesareneededtospeedupsomeoft-soughtqueriesfromverylargedatasets.Supposethereisanextremelylargedatawarehouseofweblogaccessdata,whichisrolledupbythenumberofwebaccessbythehour.Thisneedstobequeried,orsummarizedoften,involvingonlysomeofthedatafieldsfromthedatabase.Thusthequerycouldbespeededupbycreatingadatabasestructurethatincludedonlytherelevantcolumnsofthedataset,alongwiththekeyidentifyinginformation.Thisiscalledacolumnardatabaseformat,andisusefulforcontentmanagementsystems,bloggingplatforms,maintainingcounters,expiringusage,heavywritevolumesuchaslog

Page 118: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

aggregation.Columnfamilydatabasesforsystemswellwhenthequerypatternshavestabilized.

ThechoiceofNoSQLdatabasedependsonthesystemrequirements.Thereareatleast200implementationsofNoSQLdatabasesofthesefourtypes.Visitnosql-database.orgformore.

Despitethename,aNoSQLdatabasedoesnotnecessarilyprohibitstructuredquerylanguage(likeMySQL).WhilesomeoftheNoSQLsystemsareentirelynon-relational,othersjustavoidsomeselectedfunctionalityofRDMSsuchasfixedtableschemasandjoinoperations.ForNoSQLsystems,insteadofusingtables,thedatacanbeorganizedthedatainkey/valuepairformat,andthenSQLcanbeused.

ThefirstpopularNoSQLdatabasewasHBase,whichisapartoftheHadoopfamily.ThemostpopularNoSQLdatabaseusedtodayisApacheCassandra,whichwasdevelopedandownedbyFacebooktillitwasreleasedasopensourcein2008.OtherNoSQLdatabasesystemsareSimpleDB,Google’sBigTable,MemcacheDB,OracleNoSQL,Voldemort,etc.

Page 119: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ArchitectureofNoSQL

Figure6‑0‑1:NoSQLDatabasesArchitecture

OneofthekeyconceptsunderlyingtheNoSQLdatabasesisthatdatabasemanagementhasmovedtoatwo-layerarchitecture;separatingtheconcernsofdatamodelinganddatastorage.Thedatastoragelayerfocusesonthetaskofhigh-performancescalabledatastorageforthetaskathand.Thedatamanagementlayeravarietyofdatabaseformats,andallowsforlow-levelaccesstothatdatathroughspecializedlanguagesthataremoreappropriateforthejob,ratherthanbeingconstrainedbyusingthestandardSQLformat.

NoSQLdatabasesmapsthedatainthekey/valuepairsandsavesthedatainthestorageunit.Thereisnostorageofdatainacentralizedtabularform,sothedatabaseishighlyscalable.Thedatacouldbeofdifferentforms,andcomingfromdifferentsources,andtheycanallbestoredinsimilarkey/valuepairformats.

ThereareavarietyofNoSQLarchitectures.SomepopularNoSQLdatabaseslikeMongoDBaredesignedinamaster/slavemodellikemanyRDBMS.ButotherpopularNoSQLdatabaseslikeCassandraaredesignedinamaster-lessfashionwhereallthenodesintheclustersarethesame.So,itisthearchitectureoftheNoSQLdatabasesystemthatdeterminesthebenefitsofdistributedandscalablesystememergeslikecontinuousavailability,distributedaccess,highspeed,andsoon.

NoSQLdatabasesprovidedeveloperslotofoptionstochoosefromandfinetunethesystemtotheirspecificrequirements.Understandingtherequirementsofhowthedataisgoingtobeconsumedbythesystem,questionssuchasisitreadheavyvswriteheavy,isthereaneedtoquerydatawithrandomqueryparameters,willthesystembeablehandleinconsistentdata.

Page 120: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CAPtheoremDataisexpectedtobeaccurateandavailable.Inadistributedenvironment,accuracydependsupontheconsistencyofdata.AsystemisconsideredConsistentifallreplicasofcopycontainthesamevalue.ThesystemisconsideredAvailable,ifthedataIisavailableatallpointsintime.Itisalsodesirableforthedatatobeconsistentandavailableevenwhenanetworkfailurerendersthedatabasepartitionedintotwoormoreislands.Asystemisconsideredpartitiontolerantifprocessingcancontinueinbothpartitionsinthecaseofanetworkfailure.Inpracticeitishardtoachieveallthree.

ThechoicebetweenConsistencyandAvailabilityremainstheunavoidablerealityfordistributeddatastores.CAPtheoremstatesthatinanydistributedsystemonecanchooseonlytwooutofthethree(Consistency,AvailabilityandPartitionTolerance).Thethirdwillbedeterminedbythosechoices.

NoSQLdatabasescanbetunedtosuitone’schoiceofhighconsistencyoravailability.Forexample,foraNoSQLdatabase,thereareessentiallythreeparameters:

-N=replicationfactor,i.e.thenumberofreplicascreatedforeachpieceofdata

-R=Minimumnumberofnodesthatshouldrespondtoareadrequestforittobeconsideredsuccessful

-W=Minimumnumberofnodesthatshouldrespondtoawriterequestbeforeitsconsideredsuccessful.

SettingthevaluesofRandWveryhigh(R=N,andW=N)willmakethesystemmoreconsistent.However,itwillbeslowtoreportConsistency,andthusAvailabilitywillbelow.Ontheotherend,settingRandWtobeverylow(suchasR=1andW=1),wouldmaketheclusterhighlyavailable,asevenasinglesuccessfulread(orwrite)wouldlettheclustertoreportsuccess.However,consistencyofdataontheclusterwillbelowsincemanyofthemaynothaveyetreceivedthelatestcopyofthedata.

Ifanetworkgetspartitionedbecauseofanetworkfailure,thenonehastotradeoffavailabilityversusconsistency.NoSQLdatabaseusersoftenchooseavailabilityandpartitiontoleranceoverstrongconsistency.Theyarguethatshortperiodsofapplicationmisbehaviorarelessproblematicthanshortperiodsofunavailability.

Consistencyismoreexpensiveintermsofthroughputorlatency,thanisAvailability.However,HDFSchoosesconsistency–asthreefaileddatanodescanpotentiallyrendera

Page 121: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

file’sblockscompletelyunavailable.

Page 122: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

PopularNoSQLDatabasesWecovertwoofthemorepopularofferings.

Page 123: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

HBaseApacheHBaseisacolumn-oriented,non-relational,distributeddatabasesystemthatrunsontopofHDFS.AnHBasesystemcomprisesasetoftables.Eachtablecontainsrowsandcolumns,muchlikeatraditionaldatabase.EachtablemusthaveanelementdefinedasaPrimaryKey;allaccesstoHBasetablesisdoneusingthePrimaryKey.AnHBasecolumnrepresentsanattributeofanobject.Forexample,ifthetableisstoringdiagnosticlogsfromwebservers,eachrowwillbealogrecord.Eachcolumninthattablewillrepresentanattributesuchasthedate/timeoftherecord,ortheservername.HBasepermitsmanyattributestobegroupedtogetherintoacolumnfamily,sothatallelementsofacolumnfamilyareallstoredasessentiallyacompositeattribute.

Columnardatabasesaredifferentfromarelationaldatabaseintermsofhowthedataisstored.Intherelationaldatabase,allthecolumns/attributesofagivenrowarestoredtogether.WithHBaseyoumustpredefinethetableschemaandspecifythecolumnfamilies.Allrowsofacolumnfamilywillstoredsequentially.However,it’sveryflexibleinthatnewcolumnscanbeaddedtofamiliesatanytime,makingtheschemaflexibleandthereforeabletoadapttochangingapplicationrequirements.

ArchitectureOverview

HBaseisbuiltonmaster-slaveconcept.InHBaseamasternodemanagesthecluster,whiletheworkernodes(calledregionservers)storeportionsofthetablesandperformtheworkonthedata.HBaseisdesignedafterGoogleBigtable,andofferssimilarcapabilitiesontopofHadoopandHDFS.Itdoesconsistentreadsandwrites.Itdoesautomaticandconfigurableshardingoftables.Ashardisasegmentofthedatabase.

Figure6‑0‑2:HBASEArchitecture

Physically,HBaseiscomposedofthreetypesofserversinamasterslavetypeofarchitecture.

Page 124: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

(a)TheNameNodemaintainsmetadatainformationforallthephysicaldatablocksthatcomprisethefiles.

(b)Regionserversservedataforreadsandwrites.

(c)TheHadoopDataNodestoresthedatathattheRegionServerismanaging.

HBaseTablesaredividedhorizontallybyrowkeyrangeinto“Regions.”Aregioncontainsallrowsinthetablebetweentheregion’sstartkeyandendkey.Regionassignment,DDL(create,deletetables)operationsarehandledbytheHBaseMasterprocess.Zookeeper,whichispartofHDFS,maintainsaliveclusterstate.ThereisanautomaticfailoversupportbetweenRegionServers.AllHBasedataisstoredinHDFSfiles.RegionServersarecollocatedwiththeHDFSDataNodes,whichenabledatalocality(puttingthedataclosetowhereitisneeded)forthedataservedbytheRegionServers.HBasedataislocalwhenitiswritten,butwhenaregionismoved,itisnotlocaluntilcompaction.

EachRegionServercreatesanephemeralnode.TheHMastermonitorsthesenodestodiscoveravailableregionservers,anditalsomonitorsthesenodesforserverfailures.

Amasterisresponsibleforcoordinatingtheregionservers,includingassigningregionsonstartup,loadbalancingofrecoveryamongregions,andmonitoringtheirhealth.Itisalsotheinterfaceforcreating,deleting,updatingtables

ReadingandWritingData

ThereisaspecialHBaseCatalogtablecalledtheMETAtable,whichholdsthelocationoftheregionsinthecluster.ZooKeeperstoresthelocationoftheMETAtable.

ThisiswhathappensthefirsttimeaclientreadsorwritestoHBase:

TheclientgetstheRegionserverthathoststheMETAtablefromZooKeeper.

Theclientwillquerythe.META.servertogettheregionservercorrespondingtotherowkeyitwantstoaccess.TheclientcachesthisinformationalongwiththeMETAtablelocation.

ItwillgettheRowfromthecorrespondingRegionServer.

Forfuturereads,theclientusesthecachetoretrievetheMETAlocationandpreviouslyreadrowkeys.Overtime,itdoesnotneedtoquerytheMETAtable,unlessthereisamissbecausearegionhasmoved;thenitwillre-queryandupdatethecache.

Page 125: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CassandraApacheCassandraisalargelyscalableopensourcenon-relationaldatabasethatofferscontinuousuptime,simplicityandeasydatadistributionacrossmultipledatacentersandcloud.CassandrawasoriginallydevelopedatFacebookandwasopensourcedin2008.Itprovidesmanybenefitsoverthetraditionalrelationaldatabasesformodernonlineapplicationslikescalablearchitecture,continuousavailability,highdataprotection,multidatareplicationsoverdatacenters,datacompression,SQLlikelanguageandsoon.

ArchitectureOverview

Cassandraarchitectureprovidesitsabilitytoscaleandprovidecontinuousavailability.Ratherthanusingmaster-slavearchitecture,ithasamaster-less“ring”designthatiseasytosetupandmaintain.InCassandra,allnodesplayanequalrole,allnodescommunicatewithoneanotherbyadistributedandhighlyscalableprotocolcalledgossip.

So,theCassandrascalablearchitectureprovidesthecapacityofhandlinglargevolumeofdata,andlargenumberofconcurrentusersoroperationsoccurringatthesametime,acrossmultipledatacenters,justaseasilyasanormaloperationfortherelationaldatabases.Toenhanceitscapacity,onesimplyneedstoaddnewnodestoanexistingclusterwithouttakingdownthesystemanddesigningfromthescratch.

AlsotheCassandraarchitecturemeansthatunlikeothermasterslavesystems,ithasnosinglepointoffailureandthusiscapableofofferingcontinuousavailabilityanduptime.

ReadingandWritingData

Page 126: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

DatatobewrittentoaCassandranodeisfirstrecordedinanondiskcommitlogandthenitiswrittentoamemorybasedunitcalleda“memTable”.Whena“memTable”sizeexceedsacertainsetthreshold,thedataisthenwrittentofileondiskcalledan“SSTable”.Thus,inthiswaythewriteoperationisfullysequentialinnature.withmanyinputoutputoperationoccurringatthesametime,ratherthanoccurringoneatatimeoveralongperiod.

Forareadoperation,Cassandralooksinaninmemorydatastructurecalleda“Bloomfilter”thatfetchtheprobabilityofa“SSTable”havingtherequireddata.TheBloomfiltercanperformthetaskveryquicklytotellifafilehastheneededdataornot.IfitreturntruethenCassandralooksforanotherlayerofinmemorycaches,andthenfetchesthecompresseddataondisk.Iftheanswerisfalse,Cassandradoesn’tbotherwithreadingthe“SSTable”andlooksforanotherfiletofetchtherequireddata.

WriteSyntax:TTransporttr=newTSocket(HOST,PORT);

TFramedTransporttf=newTFramedTransport(tr);TProtocolprotocal=newTBinaryProtocol(tf);Cassandra.Clientclient=newCassandra.Client(protocal);

tf.open();

client.insert(userIDKey,cp,newColumn(“Colume-name”.getBytes(UTF8),“Colume-data”.getBytes(),clock),CL);

ReadSyntax:

Columncol=client.get(userIDKey,colPathName,CL).getColumn();

LOG.debug(“Columnname:”+newString(col.Colume-name,UTF8));

LOG.debug(“Columnvalue:”+newS tring(col.Colume-data,UTF8));

Page 127: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

HiveLanguageHiveisadeclarativeSQL-likelanguageforqueries.HivewasdesignedtoappealtoacommunitycomfortablewithSQL.Itisusedmainlybydataanalystsontheserverside,fordesigningreports.Ithasitsownmetadatasectionwhichcanbedefinedaheadoftime,beforedataisloaded.Hivesupportsmapandreducetransformscriptsinthelanguageoftheuser’schoice,whichcanbeembeddedwithinSQLclauses.ItiswidelyusedinFacebookbyanalystscomfortablewithSQL,aswellasbydataminersprogramminginPython.Hiveisbestusedfortraditionaldatawarehousingtasks;itisnotdesignedforonlinetransactionprocessing.

Hiveisbestsuitedforstructureddata.HivecanbeusedtoquerydatastoredinHbase,whichisakey-valuestore.Hive’sSQL-likestructuremakestransformationofdatatoandfromRDBMSiseasier.SupportingSQLsyntaxalsomakesiteasytointegratewithexistingBItools.Hiveneedsthedatatobefirstimported(orloaded)andafterthatitcanbeworkedupon.Incaseofstreamingdata,onewouldhavetokeepfillingbuckets(orfiles),andthenHivecanbeusedtoprocesseachfilledbucket,whileusingotherbucketstokeepstoringthenewlyarrivingdata.

HivedataColumnsaremappedtotablesinHDFS.ThismappingisstoredinMetadata.AllHQLqueriesareconvertedtoMapReducejobs.Atablecanhaveonemorepartitionkeys.ThereareusualSQLdatatypes,andArraysandMapsandStructstorepresentmorecomplextypesofdata.Thereareuserdefinedfunctionsformapping,aggregating

Figure6‑3:HiveArchitecture

HIVELanguageCapabilities

Hive’sSQLprovidesalmostallbasicSQLoperations.Theseoperationsworkontablesandorpartitions.Theseoperationsare:SELECT,FROM,WHERE,JOIN,GROUPBY,

Page 128: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ORDERBY.Italsoallowstheresultstobestoredinanothertable,orinaHDFSfile.

Thestatementtocreateapage_viewtablewouldbelike:

CREATETABLEpage_view(viewTimeINT,useridBIGINT,

page_urlSTRING,referrer_urlSTRING,

ipSTRINGCOMMENT‘IPAddressoftheUser’)

COMMENT‘Thisisthepageviewtable’

PARTITIONEDBY(dtSTRING,countrySTRING)

STOREDASSEQUENCEFILE;

Hereisascriptforloadingdataintothisfile.

CREATEEXTERNALTABLEpage_view_stg(viewTimeINT,useridBIGINT,

page_urlSTRING,referrer_urlSTRING,

ipSTRINGCOMMENT‘IPAddressoftheUser’,

countrySTRINGCOMMENT‘countryoforigination’)

COMMENT‘Thisisthestagingpageviewtable’

ROWFORMATDELIMITEDFIELDSTERMINATEDBY‘44’LINESTERMINATEDBY‘12’

STOREDASTEXTFILE

LOCATION‘/user/data/staging/page_view’;

ThetablecreatedabovecanbestoredinHDFSasaTextFileorasaSequenceFile.

AnINSERTqueryonthistablewilllooklike:

hadoopdfs-put/tmp/pv_2008-06-08.txt/user/data/staging/page_view

FROMpage_view_stgpvs

INSERTOVERWRITETABLEpage_viewPARTITION(dt=‘2008-06-08’,country=‘US’)

SELECTpvs.viewTime,pvs.userid,pvs.page_url,pvs.referrer_url,null,null,pvs.ip

WHEREpvs.country=‘US’;

Page 129: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 130: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

PigLanguagePigisahigh-levelprocedurallanguage.Itisusedmainlyforprogramming.Ithelpstocreateastep-by-stepflowofdatatodoprocessing.Itoperatesmostlyontheclientsideofthecluster.PigLatinfollowsaprocedureprogrammingmodelandmorenaturaltousetobuildadatapipeline,suchasETLjob.Itgivesfullcontroloverhowthedataflowsthroughthepipeline,whentocheckpointthedatainpipeline,anditsupportDAGsinpipelinesuchassplit,andgivesmorecontroloveroptimization.Pigworkswellwithunstructureddata.Forcomplexoperationssuchasanalyzingmatrices,orsearchforpatternsinunstructureddata,Pigwillgivegreatercontrolandoptions.

Pigallowsonetoloaddataandusercodeatanypointinthepipeline.Thiscanbeimportantforingestingstreamingdatafromsatellitesorinstruments.Pigalsouseslazyevaluation.PigisfasterinthedataimportbutslowerinactualexecutionthananRDBMSfriendlylanguagelikeHive.Pigiswellsuitedtoparallelizationandsoitisbettersuitedforverylargedatasetsthroughput(amountofdataprocessed)ismoreimportantthanlatency(speedofresponse).

PigisSQL-like,butdifferstoagreatextent.Itdoesnothaveadedicatedmetadatasection;theschemawillhavetobedefinedintheprogramitself.Itis.PigcanbeeasierforsomeonewhohadnotearlierexperiencewithSQL.

Page 131: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ConclusionNoSQLdatabasesemergedinresponsetothelimitationsofrelationaldatabasesinhandlingthesheervolume,natureandgrowthofdata.NoSQLdatabaseshavethefunctionalitylikeMapReduce.NoSQLdatabaseisprovingtobeaviablesolutiontotheenterprisedataneedsandcontinuetodoso.TherearefourtypesofNoSQLdatabases:columnar,Key-pair,document,andgraphicaldatabases.CassandraandHBaseareamongthemostpopularNOSQLdatabases.HiveisanSQL-typelanguagetoaccessdatafromNoSQLdatabases.Pigisaproceduralhigh-languagethatgivesgreatercontroloverdataflows.

Page 132: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ReviewQuestionsQ1:WhatisaNoSQLdatabase?Whatarethedifferenttypesofit?

Q2:HowdoesaNoSQLdatabaseleveragethepowerofMapReduce?

Q3:whatarethekindsofNoSQLdatabases?Whataretheadvantagesofeach?

Q3:WhatarethesimilaritiesanddifferencesbetweenHiveandPig?

Page 133: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 134: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Chapter7–StreamProcessingwithSparkAstreamprocessingsystemisacleverwaytoprocesslargequantitiesofdatafromavastsetofextremelyfastincomingdatastreams.Theidealstreamprocessingenginewillcaptureandreportinrealtimetheessenceofalldatastreams,nomatterthespeedorsizeofnumber.Thisisachievedbyusinginnovativealgorithmsandfiltersthatrelaxmanycomputationalaccuracyrequirements,tocomputesimpleapproximatemetricsinrealtime.Streamprocessingenginealignswiththeinfinitedynamismoftheflowofnature.

Page 135: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

IntroductionApacheSparkisanintegrated,fast,in-memory,general-purposeengineforlarge-scaledataprocessing.Sparkisidealforiterativeandinteractiveprocessingtasksonlargedatasetsandstreams.Sparkachieves10-100xperformanceoverHadoopbyoperatingwithanin-memoryconstructcalled‘ResilientDistributedDatasets’,whichhelpavoidthelatenciesinvolvedindiskreadsandwrites.WhileSparkiscompatiblewithHadoopfilesystemsandtools,alargescaleadoptionofSparkanditsbuilt-inlibraries(forMachineLearning,GraphProcessing,Streamprocessing,SQL)willdeliverseamlessfastdataprocessingalongwithhighprogrammerproductivity.SparkhasbecomeamoreefficientandproductivealternativeforHadoopecosystem,andisincreasingbeingusedinindustry.

ApacheSparkwasoriginallydevelopedin2009inUCBerkeley’sAMPLab,andopensourcedin2010asanApacheproject.Itcanprocessdatafromavarietyofdatarepositories,includingtheHadoopDistributedFileSystem(HDFS),andNoSQLdatabasessuchasHBaseandCassandra.Sparksupportsin-memoryprocessingtoboosttheperformanceofbigdataanalyticsapplications,butitcanalsodoconventionaldisk-basedprocessingwhendatasetsaretoolargetofitintotheavailablesystemmemory.Sparkgivesusacomprehensive,unifiedframeworktomanagebigdataprocessingrequirementswithavarietyofdatasetsthatarediverseinnature(textdata,graphdataetc)aswellasthesourceofdata(batchv.real-timestreamingdata).SparkenablesapplicationsinHadoopclusterstorunupto100timesfasterinmemoryand10timesfasterevenwhenrunningondisk.SparkisanalternativetoHadoopMapReduceratherthanareplacementforHadoop.Itprovidesacomprehensiveandunifiedsolutiontomanagedifferentbigdatausecasesandrequirements.

Page 136: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

SparkArchitecture

ThecoreSparkenginefunctionspartlyasanapplicationprogramminginterface(API)layerandunderpinsasetofrelatedtoolsformanagingandanalyzingdata,includingaSQLqueryengine,alibraryofmachinelearningalgorithms,agraphprocessingsystemandstreamingdataprocessingsoftware.Sparkallowsprogrammerstodevelopcomplex,multi-stepdatapipelinesusingdirectedacyclicgraph(DAG)pattern.Italsosupportsin-memorydatasharingacrossDAGs,sothatdifferentjobscanworkwiththesamedata.SparkrunsontopofexistingHadoopDistributedFileSystem(HDFS)infrastructuretoprovideenhancedandadditionalfunctionality.ItprovidessupportfordeployingSparkapplicationsinanexistingHadoopv1cluster(withSIMR–Spark-Inside-MapReduce)orHadoopv2YARNclusterorevenApacheMesos.

Nextwewillintroducethetwoimportancefeaturesinspark:RDDsandDAG.

ResilientDistributedDatasets(RDD)

RDD,ResilientDistributedDatasets,isadistributedmemorydistribution.Theyaremotivatedbytwotypesofapplicationsthatcurrentcomputingframeworkshandleinefficiently:iterativealgorithmsandinteractivedataminingtools.Inbothcases,keepingdatainmemorycanimproveperformancebyanorderofmagnitude.

RDDsareImmutableandpartitionedcollectionofrecords,whichcanonlybecreatedbycoarsegrainedoperationssuchasmap,filter,groupbyetc.Bycoarsegrainedoperations,itmeansthattheoperationsareappliedonallelementsinadataset.RDDscanonlybecreatedbyreadingdatafromastablestoragesuchasHDFSorbytransformationsonexistingRDDs.

OncedataisreadintoanRDDobjectinSpark,avarietyofoperationscanbeperformedbycallingabstractSparkAPIs.Thetwomajortypesofoperationavailablearetransformationsandactions.Transformationsreturnanew,modifiedRDDbasedontheoriginal.SeveraltransformationsareavailablethroughtheSparkAPI,includingmap(),

Page 137: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

filter(),sample(),andunion().ActionsreturnavaluebasedonsomecomputationbeingperformedonanRDD.SomeexamplesofactionssupportedbytheSparkAPIincludereduce(),count(),first(),andforeach().

DirectedAcyclicGraph(DAG)

DAGrefersadirectedacyclicgraph.Thisapproachisanimportantfeatureforreal-timenigDataplatforms.Thosetools,includingStorm,Spark,andTez,offeramazingnewcapabilitiesforbuildinghighlyinteractive,real-timecomputingsystemstopoweryourreal-timeBI,predictiveanalytics,real-timemarketingandothercriticalsystems.

DAGScheduleristheschedulinglayerofApacheSparkthatimplementsstage-orientedscheduling,i.e.afteranRDDactionhasbeencalleditbecomesajobthatisthentransformedintoasetofstagesthataresubmittedasTaskSetsforexecution.Ingeneral,DAGSchedulerdoesthreethingsinSpark:ComputesanexecutionDAG,i.e.DAGofstages,forajob;Determinesthepreferredlocationstoruneachtaskon;Handlesfailuresduetoshuffleoutputfilesbeinglost.

Page 138: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

SparkEcosystemSparkisanintegratedstackoftoolsresponsibleforscheduling,distributing,andmonitoringapplicationsconsistingofmanycomputationaltasksacrossmanyworkermachines,oracomputingcluster.SparkiswrittenprimarilyinScala,butincludescodefromPython,Java,R,andotherlanguages.Sparkcomeswithasetofintergratedtoolsthatreducelearningtimeanddeliverhigheruserproductivity.SparkecosystemincludesMesosresourcemanager,andothertools.

SparkhasalreadyovertakenHadoopingeneralbecauseofbenefitsitprovidesintermsoffasterexecutioniniterativeprocessingalgorithms.

Page 139: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

SparkforbigdataprocessingSparksupportbigdataminingthroughrelevantlibrariesincludingMLlib,GraphXandSparkR.AndthroughSparkSQLlanguageandStreaminglibrary.

MLlib

MLlibisSpark’smachinelearninglibrary.Itconsistsofbasicmachinelearningalgorithmssuchasclassification,regression,clustering,collaborativefiltering,dimensionalityreduction,aswellaslower-leveloptimizationprimitivesandhigher-levelpipelineAPIs.Atthesametime,wecareaboutalgorithmicperformance.Sparkexcelsatiterativecomputation,enablingMLlibtorunfast.SoMLlibalsocontainshigh-qualityalgorithmsthatleverageiteration,andcanyieldbetterresultsthantheone-passapproximationssometimesusedonMapReduce.Inaddition,SparkMLlibiseasytouseanditcansupportscala,Java,Python,andSparkR.

Forexample,Decisiontreesisapopulardataclassificationtechnique,SparkMLlibcansupportdecisiontreesforbinaryandmulticlassclassification,usingbothcontinuousandcategoricalfeatures.Theimplementationpartitionsdatabyrows,allowingdistributedtrainingwithmillionsofinstances.

FunctionsinDecisionTrees

class:publicstaticDecisionTreeModeltrainClassifier(…)

Methodtotrainadecisiontreemodelforbinaryormulticlassclassification.

Parameters:

•input-Trainingdataset:RDDofLabeledPoint.Labelsshouldtakevalues{0,1,…,numClasses-1}.

•numClassesForClassification-numberofclassesforclassification.

•categoricalFeaturesInfo-Mapstoringarityofcategoricalfeatures.

•impurity-Criterionusedforinformationgaincalculation.Supportedvalues:“gini”or“entropy”

•maxDepth-Maximumdepthofthetree.(suggestedvalue:4).

•maxBins-maximumnumberofbinsusedforsplittingfeatures(suggestedvalue:100).

Returns:DecisionTreeModelthatcanbeusedforprediction

SparkGraphX

Efficientprocessingoflargegraphsisanotherimportantandchallengingissue.Many

Page 140: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

practicalcomputingproblemsconcernlargegraphs.Forexample,googlehavetorunitsPageRankonbillionsofwebpagesandmaybetrillionsofweblinks.GraphXisanewcomponentinSparkforgraphsandgraph-parallelcomputation.Atahighlevel,GraphXextendstheSparkRDDbyintroducinganewGraphabstraction:adirectedmulti-graphwithpropertiesattachedtoeachvertexandedge.

Tosupportgraphcomputation,GraphXexposesasetoffundamentaloperatorssuchassubgraph,joinVertices,andaggregateMessagesonthebaissofanoptimizedvariantofthePregelAPI(PregelisthesystematGooglethatpowersPageRank).Inaddition,GraphXincludesagrowingcollectionofgraphalgorithmsandbuilderstosimplifygraphanalyticstasks.

WecomputethePageRankofeachuserasfollows:

//loadtheedgesasagraphobject

valgraph=GraphLoader.edgeListFile(sc,“outlink.txt”)

//Runpagerank

valranks=graph.pagerank(0.00000001).vertices

//jointherankwiththewebpages

valpages=sc.textFile(“pages.txt”).map{line=>valfields=line.split(“,”)(fields(0).toLong,fields(1))}

valranksByPagename=pages.join(ranks).map{case(id,(pagename,rank))=>(pagename,rank)}

//printtheoutput

println(rankByPagename.collect().mkString(“\n”))

SparkR

Risapopularstatisticalprogramminglanguagewithanumberofextensionsthatsupportdataprocessingandmachinelearningtasks.However,interactivedataanalysisinRisusuallylimitedastheruntimeissingle-threadedandcanonlyprocessdatasetsthatfitinasinglemachine’smemory.SparkR,anRpackageinitiallydevelopedattheAMPLab,canprovideanRfrontendtoApacheSparkandusingSpark’sdistributedcomputationengineallowsustorunlargescaledataanalysisfromtheRshell.SparkRexposestheRDDAPIofSparkasdistributedlistsinR.Forexample,onecanreadaninputfilefromHDFSandprocesseverylineusinglapplyonaRDD.Thereisacaseletasfollows:

sc<-sparkR.init(“local”)

Page 141: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

lines<-textFile(sc,“hdfs://data.txt”)

wordsPerLine<-lapply(lines,function(line)){length(unlist(strsplit(line,””)))})

Inadditiontolapply,SparkRalsoallowsclosurestobeappliedoneverypartitionusinglapplyWithPartition.OthersupportedRDDfunctionsincludeoperationslikereduce,reduceByKey,groupByKeyandcollect.

SparkSQL

SparkSQLisalanguageprovidedtodealwiththestructureddata.Usingthisonecanrunqueriesonthedataandgetsomemeaningfulresult.ItsupportsthequeriesthroughSQLaswellasHQL(HiveQueryLanguage)whichisApache’sHiveversionofSQL.

SparkStreaming

SparkStreaminggainsdatastreamsfrominputsources,processtheminacluster,pushouttodatabases/dashboards.Sparkfurtherchopsupdatastreamsintobatchesoffewseconds.SparktreatseachbatchofdataasRDDsandprocessesthemusingRDDoperations.Theprocessedresultsarepushedoutasbatches.

Page 142: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

SparkapplicationsSomehotdataproblemsthataresolvedwellbyatoollikeApacheSparkinclude:1.Real-timeLogDatamonitoring.2.MassiveNaturalLanguageProcessing3.LargeScaleOnlineRecommendationSystems.

AsimpleWordcountapplicationcanberuninSparkshellasbelow.

valtextFile=sc.textFile(“C:\Users\MyName\Documents\obamaSpeech.txt”)

***Comment:savesthetextfileastextFile***

valcounts=textFile.flatMap(line=>line.split(”“)).map(word=>(word,1)).reduceByKey(_+_)

***Comment:Calculatethetotalwordsbysplittingwithspace***

counts.count();

***Resultstheoutputasbelow******

Long=52

counts.saveAsTextFile(“C:\Users\MyName\Desktop\counts1”)

***Comment:savesthefileonmyDesktop***

Page 143: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

SparkvsHadoopSparkandHadooparebothpopularApacheprojectsdedicatedtobigdataprocessing.Hadoop,formanyyears,wastheleadingopensourcebigdataplatformandmanycompaniesalreadyuseadistributedcomputingframeworklikeHadoopbasedonMapReduce.Table9.1providesasummaryofthedifferencesbetweenHadoopandSpark.

Feature Hadoop Spark

Purpose Resilientcost-effectivestorageandprocessingoflargedatasets

Fastgeneral-purposeengineforlarge-scaledataprocessing

Corecomponent HadoopDistributedFilesystem(HDFS)

SparkCore,thein-memoryprocessingengine.

Storage HDFSmanagesmassivedatacollectionsacrossmultiplenodeswithinaclusterofcommodityservers.

Spark doesn’t do distributedstorage. It operates ondistributeddatacollections.

FaultTolerance Hadoop uses replication toachievefaulttolerance.

SparkusesRDDforfaulttolerancethatminimizesnetworkI/O.

Natureofprocessing

AccompaniedbyMapReduce,itincludesbatchprocessingofthisdatainparallelmode

Batch as well as streamprocessing.

SweetspotBatchprocessing

Iterativeandinteractiveprocessingjobs,thatcanfitinthememory

ProcessingSpeedMapReduceisslow.

Sparkcanbeupto10xfasterthanMapReduceforbatchprocessingandupto100xfasterforstreamprocessing.

Security Moresecure Lesssecure

Failurerecovery Hadoopcanrecoverfromsystemfaultsorfailuressincedataarewrittentodiskaftereveryoperation

WithSpark,dataobjectsarestoredinRDD.Thesecanbereconstructedafterfaultsorfailures

Analyticstools Built-inMLLib(Machine

Page 144: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Separateengine Learning)andGraphX(GraphProcessing)libraries

Compatibility PrimarystoragemodelisHDFS CompatibilitywithHDFSandotherstorageformats

Languagesupport Java Scalaisnativelanguage.APIsforpython,java,R,others.

DrivingOrganization Yahoo AMPLabsfromUCBerkeley

Technologyowners Apache,Open-source,free Open-source,free

KeyDistributors Cloudera,Horton,MapR Databricks,AMPLabs

CostofSystem MediumtoHigh MediumtoHigh

Page 145: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ConclusionSparkisanewintegratedsystemforbigdataprocessing.ItsmostimportantcoreabstractionisRDDs,alongwithrelevantlibrarieslikeMLlibandGraphX.Sparkisareallypowerfulopensourceprocessingenginebuildaroundspeed,easeofuse,andsophisticatedanalytics.

Page 146: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ReviewQuestionsQ1:Describethesparkecosystem.

Q2:CompareSparkandHadoopintermsoftheirabilitytodostreamcomputing?

Q3:WhatisanRDD?HowdoesitmakeSparkfaster?

Q4:DescribethreemajorcapabilitiesinSparkfordataanalytics.

Page 147: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 148: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 149: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Chapter8–IngestingDataWholenessADataingestingsystemisareliableandefficientpointofreceptionforalldatacomingintoasystem.Thissystemisdesignedtobeflexibleandscalabletoreceivedatafromvarioussources,atvarioustimesandspeedsandquantities.Theingestsystemmakesthedataavailableforusebythetargetapplicationsinrealtime.Ideally,alldatawouldbesmoothlyreceived,andmadeavailablefordownstreamapplicationstosecurelyandreliablyaccessattheirownconvenience.Adedicatingdataingestmechanismisachievedbycreatingafastandflexiblebufferforreceivingandstoringallincomingstreamsofdata.Thedatainthebufferisstoredinasequentialmanner,andismadeavailabletoallconsumingapplicationsinafastandorderlymanner.

BigDataarrivesintoasystematunpredictablespeedsandquantities.Businessapplicationsthereafterreceiveandprocessthisdataatsomeplannedthroughputcapacity.Aningestbufferisneededtocommunicatethedatawithoutlossofdataorspeed.Thisbufferideahashistoricallybeencalledamessagingsystem,nottoodissimilarfromamailboxsystematthepostoffice.Incomingmessagesareputintoasetoforganizedlocations,fromwherethetargetapplicationswouldreceivethemwhentheyareready.

Withhugeamountsofdatacominginfromdifferentsources,andmanymoreconsumingapplications,apoint-to-pointsystemofdeliveringmessagesbecomesinadequateandslow.Alternatively,incomingdatacanbecategorizedintocertaintopics,andstoredintherespectivelocationorlocationsforthosetopics.Insteadofdatabeingreceivedandheldinstorageforaspecifictargetapplication,nowthedatamaybeconsumedbyanyapplicationthatisinterestedindatarelatedtoatopic.Eachconsumingapplicationcanchoosetoreaddataaboutoneormoretopicsofitsinterest.Thisiscalledthepublish-and-subscribesystem.

Page 150: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

MessagingSystemsAMessagingSystemisanasynchronousmodeofcommunicatingdatabetweenapplication.Therearetwogenerickindsofmessagingsystems−apoint-to-pointsystem,andapublish-subscribe(pub-sub)system.Mostofthemessagingpatternsnowfollowpub-submodel.

PointtoPointMessagingSystem

Inapoint-to-pointsystem,everymessageisdirectedataparticularreceiver.Acommonqueuecanreceivemessagesfrommanyproducersormessages.Anyparticularmessagecanbereceivedandconsumedbyonlyonereceiver.Oncethattargetconsumerreadsamessageinthequeue,thatmessagedisappearsfromthatqueue.ThetypicalexampleofthissystemisanOrderProcessingSystem,whereeachorderwillbeprocessedbyoneOrderProcessor.

Publish-SubscribeMessagingSystem

Inapub-submessagingsystem,theapplicationspublishtheiroutputtoastandardmessagingqueue.Thetargetrecipientwillonlyneedtoknowwheretogetthemessage,wheneveritisreadytopickupthemessage.Applicationsthuscanignorethemechanicsofinteractionwithotherapplications,andsimplycareaboutthemessageitself.Thisisespeciallyvaluablewhentheremaybemanytargetrecipientsforamessage.Inapub-subsystem,messagesareenteredintothemessagingqueueasynchronouslyfromclientapplications.

Amessagequeuingsystemneedstobefastandsecuretoservemanyapplications,bothproducersandsubscribers.Messagesarealsoreplicatedacrossmultiplelocationsforreliabilityofdata.

TherearetwopopularDataingestingsystemsusedinBigData.Anoldersystem,calledFlume,iscloselytiedtotheHadoopdistributedfilesystem.ThenewandmorepopularsystemisageneralpurposesystemcalledApacheKafka.Inthischapterwewilldiscussthenewsystem,Kafka.

Page 151: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ApacheKafkaApacheKafkaisanopensourcepublish-and-subscribemessagebrokersystem.Kafkaaimstoprovideanintegratedhigh-throughput,low-latencymessagingplatformforhandlingreal-timedatafeeds.Intheabstract,itisasinglepointofcontactbetweenallproducersandconsumersofdata.AllproducersofdatasenddatatoKafka.AllconsumersofdatareaddatafromKafka.(Figure8.1)

Figure8‑1:Kafkacoreidea

Kafkaisadistributed,partitioned,scalable,replicatedmessagingsystem,withasimplebutuniquedesign.ItwasinitiallydevelopedbyLinkedInandwasopensourcedinearly2011.ApacheSoftwareFoundationisnowresponsibleforitsdevelopmentandimprovement.Kafkaisavaluableforanenterpriseslevelinfrastructurebecauseofitssimplicityandscalability.Kafkasystemiswritteninthehigh-levelScalaprogramminglanguage.

Page 152: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

UseCasesFollowingaresomepopularusecasesofApacheKafka.

Messaging

KafkaisaverygoodalternativeforatraditionalmessagebrokerbecauseKafkamessagingsystemhasbetterthroughput,builtinpartitioning,replicationandbetterfaulttolerance.Kafkaisverygoodsolutionforalargescalemessageprocessingapplications.

WebsiteActivityTracking

WebsiteActivityTrackingwasoneofinitialusecasesforKafkaforLinkedIn.Users’onlineactivitytrackingpipelinewasrebuiltasasetofrealtimedatafeeds.Generalwebactivitytrackingincludesverylargevolumeofdata,andKafkaisverygoodathandlingthishugevolumeofdata.Useractivitytypessuchaspageview,searches,clicks,etccanbedesignatedascentraltopics,andtheactivitydatacanbepublishedtothosetopics.Thoseeventsareavailableforrealtimeorofflineprocessingandreporting.

StreamProcessing

PopularframeworkssuchasStormandSparkStreamingreaddatafromatopic,processit,andsendtootherusersandconsumerapplications.TheymayevenwriteitbacktoKafkatoanewtopic.Kafka’sstrongdurabilityisalsoveryusefulforstreamprocessing.

LogAggregation

ActivityLogaggregationtypicallygathersphysicallogfilesfromserversandputsthemallinacentralplaceforprocessing.Kafkacanabstractawaythedetailsofthefilesandprovideacleanerabstractionoflogdataasastreamofmessages.UseofKafkathenallowsforlower-latencyprocessingandeasiersupportformultipledatasourcesanddistributeddataconsumption.Unlikededicatedlog-centricsystems,Kafkaoffershigherperformanceandstrongerdurabilityguaranteesduetoreplication.

CommitLogKafkacanbeusedasexternalcommitlogforadistributeddatabasesystem.Thisauditlogcanhelptore-syncdatabetweenthefailednodestorestoretheirdata.ThelogcompactioninKafkahelpstoachievethisfeaturemoreefficiently.

Page 153: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

KafkaArchitectureIntheabstract,Kafkabrokersdealwithproducersandconsumersofdata.Aproducerpushesdataintotheingestsystematitsownspeed,scaleandconvenience.Aconsumerpullsdataoutofthesystematitsownspeed,scaleandconvenience.Allthereceiveddataisorganizedbycategories,calledtopics.Incomingdataissortedandstoredintotopicservers.Theconsumersofdatacansubscribetooneormoretopics(Figure8.2).

Figure8‑2:KafkaEcosystem

Therearemorethanonebrokers(alsocalledservers,orpartitions)foreachtopic,forreliabilityofthemessagingsystem.Thustwoormorebrokerswillstoredataoneachtopic.Onlyonebrokercanbeleaderatanygiventime.Intheleadbrokerfails,thenasecondonecanautomaticallytakeoverandpreventthelossofaccesstodata.

Kafkaisdesignedfordistributedhighthroughputsystems.Incomparisontoothermessagingsystems,Kafkahasbetterthroughput,built-inpartitioning,replicationandinherentfault-tolerance,whichmakesitagoodfitforlarge-scalemessageprocessingapplications.Ithastheabilitytohandlealargenumberofdiverseconsumers.ItintegratesverywellwithApacheStorm,Sparkandotherreal-timestreamingdataapplications.Kafkaisveryfastandcanperform2millionwrites/sec.Italsoguaranteeszerodowntimeandzerodataloss.

TherearealotofcontributingorganizationshelpingtoimprovetheKafkaopen-sourcesystem.Ithasverywelldocumentedonlineresources.IthasbeenusedbymanybigorganizationssuchasLinkedIn,CiscoSystem,Spotify,Paypal,HubSpot,Shopify,Uberandmore.HubSpotusesKafkatodeliverrealtimenotificationofwhenarecipientopenstheiremail.PaypalusesKafkatoprocessmillionsofupdatesinaminute.

Producers

Aproducerisresponsibleforselectingthepartition,andthetopicforthemessagethatitwantstoconvey.Itcanuseround-robinalgorithmtobalancetheloadamongpartitions.Therecanbebothsynchronousandasynchronousproducersforproducingmessageandpublishingtothepartition.

Consumers

Aconsumerisresponsibleforreadingthedataaboutthetopicthatithassubscribed.Theconsumerisresponsibleforreadingthedatawithinareasonableperiodoftime,beforethe

Page 154: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

queuesareemptiedforefficientmanagementofstorage.Differentconsumingapplicationscanreadthedataatdifferenttimes.Kafkahasstrongerorderingguaranteesthanatraditionalmessagingsystem.Aconsumerneedstoknowhowfarithasreadinthatqueue,soastoavoidduplicatesorlosesomedata.

Broker

AbrokerisaserverinaKafkacluster.Theclustermayhavemanysuchserversorbrokers.

Topic

Atopicisacategoryintowhichmessagesarepublished.Foreachtopicthereisaseparatepartitionlogforstorageofmessages.Eachpartitionhasanorderedsequenceofmessagesforthattopic.Eachmessageinthepartitionisassignedauniquesequentialnumber,alsocalledtheoffset.Thisoffsethelpstoidentifyeachmessagewithinthepartition.

Theconsumerreadsthedatasequentiallyaccordingtooffsetnumbers.Theconsumermaintainstheoffsettorememberhowfarithasread.Generally,theoffsetincreaseslinearlyasmessagesareconsumed.However,aconsumercanresetoffsettoaccessthedatagainandreprocessitasneeded.

TheKafkaclusterkeepsallthepublishedmessageswhetherornottheyhavebeenconsumedforaconfigurableperiodoftimeornot.Forexample,ifthelogretentionissettosevendays,thenforthesevendaysafterpublishing,themessageisavailableforconsumption.Aftersevendays,Kafkadiscardsthemessagestofreeupspace.

Kafka’sperformanceisnotaffectedbythesizeofdata.Eachpartitionmustfitontheserversthathostit,butatopicmayhavemultiplepartitions.ThisenablesKafkatomanageanarbitraryamountofdata.Also,itactsastheunitofparallelism.

SummaryofKeyAttributes1. Diskbased:Kafkaworksonaclusterofdisks.Itdoesnotkeepeverythingin

memory,andkeepswritingtothedisktomakethestoragepermanent.2. Faulttolerant:DatainKafkaisreplicatedacrossmultiplebrokers.Whenany

leaderbrokerfails,afollowerbrokertakesoverasleaderandeverything

Page 155: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

continuestoworknormally.3. Scalable:Kafkacanscaleupeasilybyaddingmorepartitionsormorebrokers.

Morebrokershelptospreadtheloadandthisprovidesgreaterthroughput.4. Lowlatency:Kafkadoesverylittleprocessingonthedata.Thusithasverylow

latencyrateMessagesproducedbytheconsumerarepublishedandavailabletotheconsumerwithinafewmilliseconds.

5. FiniteRetention:Kafkabydefaultkeepsthemessageintheclusterforaweek.Afterthatthestorageisrefreshed.Thusthedataconsumershaveuptoaweektocatchupondata,incasetheyfallbehindforanyreason.

Distribution

TheKafkaclustermaintainsmultipleserversoverthedistributednetwork.Thepartitionsofthelogaremaintainedoverthisnetwork.Eachserverhandlesdataandrequestsforashareofthepartitions.Eachpartitionisreplicatedacrossaconfigurablenumberofserversforfaulttolerance.Butoneoftheserverforeachpartitionactsasthemainserveralsocalled“leader”whileitmayormaynothaveoneormoresecondaryserveralsoknownas“followers”.Theleaderserverisresponsibleforhandlingallthereadandwriteoperationforthepartitionwhilethefollowerssilentlyreplicatestheleader.Thefollowerserverbecomesveryhelpfulwhentheleaderserverfails.Thefollowerserverautomaticallybecomestheleaderandthenhandlesthefailure.Oneservercanbealeaderforsomeofthepartitionsonit,whileitmaybefollowerforotherpartitions.Thusoneservercanactasbothleaderandfollower.Thishelpstobalancetheworkloadontheserverswithinthecluster.

Guarantees

Messagessentalwaysmaintaintheordertheyweresent.Forexample,ifamessageM1andM2weresentbythesameproducerandM1wassentfirstthenthemessageM1willhaveloweroffsetthanmessageM2.Therefore,M1willalwaysappearbeforetheM2fortheconsumer.

EachtopichasareplicationfactorNandthesystemcantolerateuptoN-1serverfailureswithoutlosinganymessagescommittedtothelog.

ClientLibraries

Kafkasupportsfollowingclientlibraries:

1. Python:PurepythonimplementationwithfullprotocolsupportandConsumerProducerarealsoincluded.

2. C:HighperformanceClibrarywithfullprotocolsupport.3. C++,Ruby,Javascriptandmore.

Page 156: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ApacheZooKeeperKafkaisbuiltontopofZooKeeper.ApacheZookeeperisadistributedconfigurationandsynchronizationserviceinHadoopclusters.HereitservesasthecoordinationinterfacebetweentheKafkabrokersandconsumers.TheKafkaserversstoresbasicmetadatainZookeeperandsharesinformationabouttopics,brokers,andconsumeroffsets(queuereaders)andsoon.

SinceZookeeperdoesitownlayersofreplication,thefailureofaKafkabrokerdoesnotaffectthestateoftheKafkacluster.EvenifZookeeperfails,Kafkawillrestorethestate,oncetheZookeeperrestarts.ThisgiveszerodowntimeforKafka.Zookeeperalsomanagesthealternativeleaderbrokerselection,incaseofaKafkaleaderfailure.KafkaProducerexampleinJava

//Configure

Propertiesconfig=newProperties();

config.setProperty(ProducerConfig.BOOTSTRAP_SERVER_CONFIG,“localhost:8082”);

KafkaProducerproducer=newKafkaProducer(config);

ProducerRecordrecord=newProducerRecord(“topic”,“key”.getBytes(),”value”.getBytes());

Future<RecordMetaData>response=producer.send(record);

Page 157: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ConclusionBigdataisingestedusingadedicatedsystem.Theseoftentaketheformofmessagingsystems.Publish-and-subscribesystemsareefficientwaysofdeliveringdatafrommanysourcestomanytargets,inareliable,secureandefficientway.Kafkaisanopen-source,reliable,secure,andscalabledatapublish-subscribemessagingsystem.Itdealswithproducersaswellasconsumersofdata.Messagesarepublishedtoasetofcentraltopics.Eachconsumercansubscribetoanynumberoftopics.Kafkausesaleader-followersystemofmanagingreplicatedpartitionsforthesamesetofdata,toensurefullreliabilityandzerodowntime.

Page 158: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ReviewQuestionsQ1:Whatisadataingestsystem?Whyisitanimportanttopic?

Q2:Whatarethetwowaysofdeliveringdatafrommanysourcestomanytargets?

Q3:WhatisKafka?Whatareitsadvantages?Describe3usecasesofKafka.

Q4:Whatisatopic?Howdoesithelpwithdataingestmanagement?

Page 159: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

References1.http://kafka.apache.org/documentation.html#introduction

Page 160: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 161: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 162: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Chapter9–CloudComputingPrimerCloudcomputingisacost-effectiveandflexiblemodeofdeliveringITinfrastructureasaservicetoclients,overinternet,onameteredbasis.ThecloudcomputingmodeloffersclientsenormousflexibilitytouseasmuchITcapacity–compute,storage,network–asneededwithouthavingtoinvestinadedicatedITcapacityonone’sown.TheITusagecanbescaledupordowninminutes.ThecomplexITinfrastructuremanagementskillsareallownedbythecloudcomputingprovider,andproblemscanberesolvedmuchfaster.TheclientcansimplyaccessasmoothlyrunningITinfrastructureoverafastinternetconnection.ITcapacityinthecloudcanbepurchasedasacustompackagedependinguponone’sneedsintermsofaverageandpeakITrequirements.Thecomputingcloudistheultimatecosmiccomputeralignedwithalllawsofnature.

Page 163: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

IntroductionManagingverylargeandfastdatastreamsisahugechallenge.Itrequiresmakingcriticaldecisionsaboutitsstorage,structure,andaccess.Thisdatawouldbestoredinlargeclustersofhundredsorthousandsofinexpensivecomputers.Suchclustersareoftencalledserverfarms.Thelocationandsizeofsuchclustersimpactscosts.Theserverfarmsmaybelocatedintheirowndatacenters,ortheymayberentedfromspecializedthird-partyorganizationscalledcloudcomputingserviceproviders.

CloudcomputingprovidestheITleadershipacost-effectiveandpredictablesolutionforreliablymeetingtheirlargedatamanagementneeds.Therearemanyvendorsofferingthisservice.Priceskeepdroppingregularly,becauseITcomponentskeepgettingcheaper,thereisgrowingvolumeofbusiness,andthereiseffectivecompetition.Withcloudcomputing,theITexpensebecomesanoperatingexpenseratherthanacapitalexpense.ThecostsofITbecomesalignedwithrevenuestreamsandmakescashflowmanagementeasier.

Oneofthemainreasonsforenterprisesmovingtocloudcomputingistoexperimentwithnewandriskyprojects.Thisflexiblemodelmakesitmucheasiertolaunchnewproductsandservices,withoutbeingexposedtotheriskofaheavylossinITinfrastructure.Forexample,anewHollywoodmovie’ssitewillhavemillionsofvisitorstoitswebsiteforamonthbeforeandforamonthafterthemovie’sreleasedate.Afterthatthevisitstothewebsitewilldropdramatically.Thewebsiteownerwouldbenefitenormouslyfromusingacloudcomputingmodelwheretheypayforthepeakwebusagecapacityforthosefewmonths,andmuchlessastheusagedropsdown.Moreimportantly,theflexibilityensuresthattheirwebsitewillnotcrashjustincasethemoviebecomesasuper-hitandattractsunusuallylargenumberofvisitorstothewebsite.

Page 164: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CloudComputingCharacteristicsHerearethemajorcharacteristicsofacloudcomputingmodel.

1. FlexibleCapacity:Thecapacitycanscaleuprapidly.Onecanexpandandreduceresourcesaccordingtoone’sspecificservicerequirements,asandwhenneeded.Thecloudinternallydoesregularworkloadbalancingamongtheneedsofmillionsofclients,andthishelpsbringdowncostsforeveryone.

2. Attractivepaymentmodel:Cloudcomputingworksonapay-per-usemodel.i.e.onepaysonlyforwhatoneuses,andforhowlongoneusesit.ITcostsbecomeanexpenseratherthanacapitalexpensefortheclient.Theresourcepricesmaybenegotiatedatlong-termcontractrates,andcanalsobepurchasedatspotmarketrates.

3. ResiliencyandSecurity:Thefailureofanyindividualserverandstorageresourcesdoesnotimpacttheuser.TheServersandstorageforallclientsareisolatedtomaximizesecurityofdata.

Page 165: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

In-housestorageMostorganizationshavedatacentersforrunningtheirregularIToperations.Anorganizationmaydecidetoexpanditsowndatacentertostorelargestreamsofdata.Theorganizationcanensurecompletesecurityandprivacyofitsdataifitkeepsallthedatain-house.However,thecostsandcomplexityofmanagingthisdataareincreasing,anditisnotcost-effectiveforeveryorganizationtomanagehugedatacenters.Hiringandretainingscarceadvancedskillstomanagesuchdatacenterswouldalsobeachallenge.

Page 166: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CloudstorageItisnowbecomingatrendfororganizationstochoosetostoretheirdatainmassivedatacentersownedbyotherspecializedcompanies.Theirdataandprocessingcapacityresidesinsomesortofahugecloudoutthere,whichisaccessiblefromanywhereanytimethroughasimpleinternetconnection.

CompanieslikeAmazon,Google,Microsoft,Apple,andIBMareamongthemajorprovidersofcloudstorageandcomputingservicesaroundtheworld.Theyownandoperatedatacenterswithmillionsofcomputersinthem.

Figure0‑1:Acloudcomputingdatacenter

Commercially,cloudserviceprovidersareabletoconsolidatetherequirementsofthousandsormillionsofcustomers,andsupplyflexibleamountsofdatastorageandcomputingfacilityavailabletoclientsonaper-usagebasis.Thispaymodelissimilartohowelectricutilitycompanieschargeconsumersfortheirusageofelectricityinhomesandoffices.Cloudcomputingoffersmuchlowercostsperuse,justlikeusingtheelectricutilitycostsmuchlessthanowningandoperatingone’sownelectricitygenerators.

Page 167: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Amajordisadvantageofcloudstorageisthatthedataisstoredawayfromone’sphysicalcontrol.Thussecurityofpreciousdataislefttothehandsofthecloudcomputingprovider.Whilethesecurityprotocolsarerapidlyimproving,however,therearenofailsafemethodsforsecuringdatainthecloud.Thereisalsoariskofbeinglockedintooneprovider’sinfrastructure.Thecost-benefittradeoffshavedefinitelytiltedtowardsusingcloudcomputingproviders.Atsomefuturepointintime,thecloudservicesprovidersmightbeheavilyregulatedliketheelectricutilities.

Page 168: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CloudComputing:EvolutionofVirtualizedArchitectureCloudcomputingisessentiallyacommercialmodelforvirtualizedserverinfrastructure.IBMbegantooffertime-sharingservicesonitsmainframecomputersbeginninginthe1960s.Nowthatsametechnologyhasbeenofferedonnetworksofsmallmachinesthroughthevirtualizationprocess.

Virtualizationassumesthatlogicalmachinescanbedifferentiatedfromphysicalmachines.AphysicalservercouldrunmultipleVirtualMachines(VMs);andonevirtualmachinemayspanmultiplephysicalservers.Thevirtualizationsoftwareiscalledahypervisor.ItabstractsallmachinesintoVirtualMachines,usingeasyGUIinterface.Avirtualizationsoftwarecantypicallyrunonaheterogeneousphysicalinfrastructure,andconvertallITcapacityintoasingleunifiedcapacity.Thiscapacitycanthenbeprovisionedinslicesandpackages.Theuserapplicationsarenotawarethattheyarerunninginavirtualizedenvironment;sotheyrunasifrunningonadedicatedmachine.Theapplicationscanalsorunontopoftheirownnativeoperatingsystems.

Page 169: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CloudServiceModelsTherearetwomajordimensionstoconceptualizetheCloudcomputingmodels:thescopeofservicesreceived;andthecontroloverandcostofthoseservices.

1. Therangeofcloudcomputingservicesfromacloudcomputingprovider,fallinthreebroadbuckets:

1. Infrastructureasaservice:Thisisthelowestlevelofservices,andincludedonlyrawcapacityofcompute,storage,andnetworking.Thepriceforthisservicesisthelowest.

2. Platformasaservice:ThisincludesIaaS,alongwithothertechnologiesandservices.ThesearestillverygeneraltoolssuchopensourceHadooporSparkorCassandraimplementation,alongwithcertainmonitoringtools.Thecostsarealittlehigherbecauseoftheadditionalmanagementandmonitoringservicesprovidedbytheprovider.

3. Softwareasaservice:Thisincludesthecomputingplatformaswellasbusinessapplicationsthatgetworkdone.Forexample,salesforce.comwasoneofthefirstCRMapplicationsoldonlyonaSaaSmodel.Googlesellsanemailservicetoorganizationsonaper-user-per-monthbasis.Thisisalsothemostexpensivetypeofcloudservice.

2. Theotherwaythecloudservicesdifferisintermsoftheownershipandcontrol.1. Publiccloud:Thiswillbealargesharedinfrastructuremadeavailableto

oneandall,inalow-costandmulti-tenancymodel.Theclientcanaccessitusinganydevice.Thedownsideisthatthedataalsoresidesonthecloud,andthuscouldbevulnerabletotheftorhacking.Thecoststo

Page 170: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

clientarelow,andvariabledependinguponuse.2. Privatecloud:Thisisacloudversionofanin-houseITinfrastructure.

Theorganizationwillhaveexclusivecontrolovertheentireinfrastructure.Thecostswouldbefixedandhigher.

3. Hybridcloud:Thisisamixofflexibilityofcapacity,andmuchcontroloversomekeyaspectsofit.Onecouldretaincompletecontrolovercriticalapplications,whileusingsharedinfrastructurefornon-criticalapplications.

Alllevelsofinfrastructureandpaymodelsareuseful,astheyserverdifferentlevelsofneedsforclientorganizations.However,mostofthegrowthincloudcomputingishappeningbecauseoftheattractivenessofthelowcostofthepubliccloudmodel.

Page 171: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CloudComputingMythsThereareacoupleofmisconceptionsaboutthecostsandbenefitsofcloudcomputing.

1. Myth:PublicCloudcomputingwouldsatisfyalltherequirement:scalability,flexibility,payperuse,resilience,multitenancy,andsecurity.Dependinguponthetypeofserviceselected(SaaS,IaaS,orPaaS),theservicecansatisfyspecificsubsetsoftheserequirements.

2. Myth:CloudcomputingwouldbeusefulonlyifyouareoutsourcingyourITfunctionstoanexternalserviceprovider.OnecoulduseaprivatecloudcomputingmodelforasectionofITapplicationstoofferon-demand,scalable,andpay-per-usedeploymentswithinyourenterprise’sowndatacenter.

Page 172: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CloudComputing:GettingStartedHerebelowisaframeworkforcloudadoption.Learnmoreaboutthecontextforgettingbenefitsfromcloudcomputing.Selecttherightmodelandlevelofcloudcapacity.Setuptheapplicationsandamonitoringsystemforthoseapplicationandthetotalcloudfootprint.Chooseaserviceprovider,sayAmazonWebServices,theleadingproviderofcloudcomputing.UseAppendixAtoinstallHadooponAWSEC2publiccloud

infrastructure.

Page 173: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ConclusionCloudcomputingisabusinessmodeltoprovideshared,flexible,cost-effectiveITinfrastructuretogetstartedquicklyonbuildinganapplication.ForBigDataapplications,itcanbeevenmoreattractivetotestthesystemusingrentedfacilities,beforemakingthedeterminationofinvestingindedicatedITinfrastructure.

Page 174: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ReviewQuestionsQ1:DescribeCloudComputingmodel.

Q2:Whataretheadvantagesofcloudcomputingoverin-housecomputing

Q3:DescribethetechnicalarchitectureforCloudcomputing.

Q4:Nameafewmajorprovidersofcloudcomputingservices.

Page 175: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 176: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 177: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Section3

ThissectioncoverstheotherrelevantconceptsandtutorialsforeffectivelymanagingandutilizingBigData.

Chapter10willbringallthetoolstogetherinacasestudyofdevelopingwebloganalyzer,asanexampleofausefulBigDataapplication.

Chapter11willcovertheoverallviewofDataMiningtoolsandtechniquestoextractbenefitfromBigData.

Appendix1showsstepbystep,thewaytoinstallaHadoopclusteronacloudcomputingplatform.

Appendix2isatutorialoninstallingandrunningSpark.

Page 178: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 179: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 180: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Chapter10–WebLogAnalyzerapplicationcasestudy

IntroductionAwebloganalyzerisanautomatedsoftwaretoolthathelpstoanalyzeandmakedecisionsonanumberofissuesregardingwebapplicationserverlogs.Anidealwebloganalyzerwouldanalyzeunlimitedstreamsofdataandhelpkeeptheentireuniverserunningsmoothlyandwithoutfault.Thiswouldbedonebyeliminatingtheneedformanuallyaccessingthelogs,automatingtheflowofinformation,andalertingthesystemadministratorasneeded.

Page 181: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Client-ServerArchitectureEveryweb-basedapplicationrunsonaclient-serverarchitecture.Clientsareentitiesthataccessservers,andserversareentitiesthatrespondtotheclientwithasolution.Alotofclientssimultaneouslytrytoaccessservers.Theserversmaybedatabaseserver,networkserver,theapplicationserver,oranyserverinthen-tierarchitecture.Foreachrequest,alogentryisgenerated.Thespeedofaccessrequestsdeterminedthestreamoflogentries.Thisleadstoapotentiallyhugelogovertime.Thelogcanbeprocessedasstreamofdata.Thislogcanalsobestoredontheserversforlateranalysis.

Logscanbeusedformonitoring,auditandanalysispurposes.Itcanhelpwitherrordiagnosticsincaseawebsitebecomessloworitgoesdown.Logscanbeanalyzedtodetecthackingactivity.Theycanalsobeanalyzedtosummarizethepopularityofwebpages,andthedistributionofthepagerequesters.Itcanhelpwithaccessvolumes,andforscalingupordowntheinfrastructure.

Page 182: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

WebLoganalyzerTheloganalyzerreceivedstreaminglogsfromaserverlocation,andanalyzesmultiplethingsusingmanyalgorithmstogeneratethedesiredresults.Thesystemiscompletelyautomated.Thelogisproduced,anditisconsumedittomakereal-timereports.Itiseasytoimaginethemassivedataflowproducedbythelogintheserverenvironmentwhileitisalsobeinganalyzedsimultaneouslyontheadministratorside.`

Requirements

Thisisaloganalyzertoanalyzeawebapplicationhostedonaserver.Itisabusyapplicationownedbyabigcompany.Itreceivesmorethan15000webaccessrequestsperhour.Alltheaccessrequestsneedtobelogged,anddumpedtoHadoopFilesystemperiodically.Theanalyzerisrequiredtoingestreal-timelogdata,andfilteroutapartofdataforanalyzinganddumpingtoHDFS.Ithastodostreamingdataflowmanagementaswellasbatchprocessing.TheanalyzerneedstoprocessthedatabeforeitisdumpedintoHDFS,andalsoafteritisputintoHDFS.Thesystemadministratorsshouldbealertedinrealtimeaboutpossiblethreats,overloads,delays,potentialserrors,andanyotherdamages.Theresultsofalltheanalysesneedtobestoredinadatabaseforlaterpresentationinagraphicalformat.Theresultshavetobemadeavailableforanyperiodoftime,withoutanymissingtimevalues.Thelogdatahastobepreservedforfuturewithoutlosinganylogdata.

SolutionArchitecture

GetstreamingdatausingApacheFlume,andsendittoHDFS.UseApacheSparkfordataflowmanagementplatformandprocessingengine.StoretheresultsofanalysisinMongoDB.Thisisasafesolution,becausethedatagetsstoredintoHadoopclusterandisavailableforfuturerequirements,evenwhileitisbeinganalyzedinrealtime.Theresultsofreal-timeprocessingalsogointoMongoDB.

Page 183: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Fig10.1:WebLogAnalyzerArchitecture

Page 184: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

BenefitsofthissolutionTheadvantagesofthissolutionare:

1. RealtimeloggingandanalysisdatageneratedonserverisstreameddirectlytoHDFSbyFlumeagentwithoutdelay.Everylogentrygeneratedovereverysinglepointoftimeisanalyzedandusedformonitoringanddecisionmaking,

2. Automaticloghandlingandstorage.LoadingdataintoHDFSnormallyrequiresmanuallyrunningcertainHadoopcommands.ThisloganalyzerusesaFlumeagentorsparkstreamingtohandlealldataonitsown,withoutanyexternallymanagedefforts.

3. Easyandconvenientimplementusingbuilt-inandeasy-to-customizemachinelearningalgorithmsinSpark.

4. Easyerrorhandling,serverrequesthandling,andoverallserverperformanceoptimization.Itmakesserversmarterbykeepingtrackofalmosteveryaspectsofserver.

Page 185: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

TechnologystackThetechnologystackusedforthisapplicationisshownbelow.Abriefofeachcomponentfollows.

1. ApacheSparkv22. Hadoop2.6.0cdh53. ApacheFlume4. Scala,Java5. MongoDB6. RestFulWebservices7. FrontUItools8. LinuxShellScripts

Page 186: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ApacheSparkSparkisfastin-memory-basedclustercomputingtechnology,designedforfastandstreamingcomputation.ItisbuiltontopofHadoopandMapReducesystem,anditextendsMapReducemodeltousemoretypesofcomputation,whichincludesinteractivequeriesandstreamprocessing.Ithaslotoflibrariesandpackageslikemachinelearning(MLLib),graphcomputation(GraphX)etc.Itclaimstoexecute10to100timesfasterthanHadoopbecauseofitsin-memorycomputationmodel.ItalsosupportsmultiplelanguagessuchasScala,Python,Java,andR.

SparkDeployment

1. Standalone2. HadoopYARN3. SIMR:SparkinmapReduce//Mesos

ComponentsofSpark

SparkSql:DataabstractioncalledschemaRDD,whichprovidessupportforstructuredandsemi-structureddata.

SparkStreaming:IngestsdatainminibatchandperformRDDtransformationonthosemini-batches.StreamingdataanalyticsusingRDD

MLib(machinelearning):Itisadistributedmachinelearningframework,whichoperatesin-memoryathighspeed,andoffersmanyMLalgorithms.

GraphX:ThisdistributedgraphprocessingframeworkprovidesAPIformanygraphcomputationalgorithms.

SparkCore:Thisisageneralexecutionengineforsparkplatformuponwhichallotherfunctionalityisbuilt.Ittakescareoftaskdispatchingandscheduling,andbasicI/Ofunctionalities.

Spark-shell:Itisapowerfultooltoanalyzedatainteractively.Itisavailableonscalaandpython.Spark’sprimarydataabstractionisanin-memorycollectionsofitemscalledRDD.ItcanbecreatedfromHadoopinputformatslikeHDFS,andbytransformingexistingRDDsusingfiltersandmapsintonewRDDs.

ScriptingandProgrammingmodelusingSparkContext:OnecanuseanIDEtodevelopandtesttheanalyticscode.OnecanthencreateajartoruntheanalyticsusingHadooparchitecture.Thejarcanalsobesubmittedusingspark-submitutilitytotheSparkengine.Forexample:

Page 187: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

spark-submit—classapache.accesslogs.ServerLogAnalyzer—master

*localScalaSpark/Scala1/target/scala-2.10/Scala1-assembly-1.0.jar>output.txt

Page 188: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

HDFSHDFSisadistributedfilesystem,thatisatthecoreofHadoopsystem.

-Deployedonlowcostcommodityhardware

-Faulttolerant

-SupportsBatchProcessing

-Designedforlargedatasetorlargefiles

-Maintainscoherencethroughwriteoncereadmanytimes

-Movingcomputationtothelocationofthedata.

Page 189: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

MongoDBItisdocument-orienteddatabase.ItcameintoexistenceasaNoSQLdatabase.

Page 190: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ApacheFlumeFlumeisanopensourcetoolforhandlingstreaminglogsordata.Itisadistributedandreliablesystemforefficientlycollecting,aggregatingandmovinglargeamountofdatafrommanydifferentsourcestoacentralizeddatastore.ItisapopulartooltoassistwithdataflowandstoragetoHDFS.Flumeisnotrestrictedtologdata.Thedatasourcesarecustomizablesoitmightbeanysourcelikeeventdata,trafficdata,socialmediadata,oranyotherdatasource.ThemajorComponentsofFlumeare:

-Event

-Agent

-DataGenerators

-CentralizedStores

Page 191: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

OverallApplicationlogicThesystemreadsaccesslogsandpresentstheresultsintabularandgraphicalformtoendusers.Thissystemprovidesthefollowingmajorfunctions:

1. Calculatecontentsize2. CountResponsecode3. AnalyzerequestingIP-address4. ManageEndpoints

Page 192: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

TechnicalPlanfortheApplicationTechnically,theprojectfollowsthefollowingstructure:

1. FlumetakesstreaminglogfromrunningapplicationserverandstoresinHDFS.Flumeusescompressiontostorehugelogfilestospeedupthedatatransferandforstorageefficiency.

2. ApacheSparkusesHDFSasinputsourceandanalyzesdatausingMLLib.ApacheSparkstoresanalyzeddatainMongoDB

3. RESTfuljavaservicepresentsJSONobjectsfetchingfromMongoDBandsendingtoFrontend.Graphicaltoolsareusedtopresentdata.

Page 193: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ScalaSparkcodeforloganalysisNote:ThisapplicationiswritteninScalalanguage.Belowistheoperativepartofthecode.VisitgithublinkbelowforthecompleteScalacodeforthisapplication.

//calculatessizeoflog,andprovidesmin,maxandaveragesize

//cachingisdoneforrepeatedlyusedfactors

defcalcContentSize(log:RDD[AccessLogs])={

valsize=log.map(log=>log.contentSize).cache()

valaverage=size.reduce(_+_)/size.count()

println(“ContentSize::Average::”+average+””+

”||Maximum::”+size.max()+“||Minimum::”+size.min())

}

//SendalltheresponsecodewithitsfrequencyofoccurrenceasOutput

defresponseCodeCount(log:RDD[AccessLogs])={

valresponseCount=log.map(log=>(log.responseCode,1))

.reduceByKey(_+_)

.take(1000)

println(s”““ResponseCodesCount:${responseCount.mkString(“[“,“,”,“]”)}”””)

}

//filtersipaddressesthathavemorethen10requestsinserverlog

defipAddressFilter(log:RDD[AccessLogs])={

valresult=log.map(log=>(log.ipAddr,1))

.reduceByKey(_+_)

.filter(count=>count._2>1)

//.map(_._1).take(10)

.collect()

println(“IPAddressesCount::${result.mkString(“[“,“,”,“]”)}”)

}}

Page 194: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

SampleLogdataSampleInputData:

InputFields(selectedfields):

Certainfieldshavebeenomittedtomakethecodeclear.Theresponsecodehasbeencoloredinredasitisthebasisofthemajorreports.

1. ipAddress:String,2. dateTime:String,3. method:String,4. endPoint:String,5. protocol:String,6. responseCode:Long,7. contentSize:Long

SampleInputRowsofData:

64.242.88.10[07/Mar/2014:16:05:49-0800]“GET/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariablesHTTP/1.1”40112846

64.242.88.10[07/Mar/2014:16:06:51-0800]“GET/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2HTTP/1.1”2004523

64.242.88.10[07/Mar/2014:16:10:02-0800]“GET/mailman/listinfo/hsdivisionHTTP/1.1”2006291

64.242.88.10[07/Mar/2014:16:11:58-0800]“GET/twiki/bin/view/TWiki/WikiSyntaxHTTP/1.1”2007352

64.242.88.10[07/Mar/2014:16:20:55-0800]“GET/twiki/bin/view/Main/DCCAndPostFixHTTP/1.1”2005253

64.242.88.10[07/Mar/2014:16:23:12-0800]“GET/twiki/bin/oops/TWiki/AppendixFileSystem?template=oopsmore&param1=1.12&param2=1.12HTTP/1.1”20011382

Page 195: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

SampleOutputofWebLogAnalysisContentSize::Average::10101||Maximum::138789||Minimum::0

ResponseCodesCount:[(401,113),(200,591),(302,1)]

IPAddressesCount::[(127.0.0.1,31),(207.195.59.160,15),(67.131.107.5,3),(203.147.138.233,13),(64.242.88.10,452),(10.0.0.153,188)]

EndPoints::[(/wap/Project/login.php,15),(/cgi-bin/mailgraph.cgi/mailgraph_2.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_0.png,12),(/wap/Project/loginsubmit.php,12),(/cgi-bin/mailgraph.cgi/mailgraph_2_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_1.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_0_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_1_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_3_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_3.png,12)]

IntermediatedataisstoredinHadoopFileSysteminCSVformat

Toseedetailedcode,visit:https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/scala/src/main/scala/com/databricks/apps/logs/chapter1/LogAnalyzer.scala

Thiswebloganalyzercanbeenhancedinmanyways.Forexample,itcananalyzehistoryoflogsfrompreviousyearsanddiscoverwebaccesstrends.Thisapplicationcanalsobemadetodiscarddataolderthan5yearsintopermanentandbackupstorage.

Page 196: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ConclusionandFindingsTherearemorethan100technologiesaroundApacheecosystem.MostbasicistheMapReducetechniqueusedbyHadoopengine.ManystacksareavailableontopofMapReduce.Itisimportanttoincorporatetherightsetsofelementstodeveloptherightstackfortheparticularlargescaledataanalytics.AfewawesometechnologieslikeHDFS,Spark,Hive,MongoDB,andFlume/Kafkaislikelytomakethebigdataapplicationpowerfulandworthy.

Itisalsousefultoexperimentwithmanyothertechnologiesduringthedevelopmentofthisloganalyzer.FlumeandKafkaaremostpowerfultoolstohandlestreamingdata.SparkhasitsownstreamingAPI,butit’snoteasytoincorporatewithHDFSstorage.DevelopingthisapplicationalsohelpstolearnLinuxbasedtasksandshellscriptsalongwithsomedatahandlingtoolslikeAWKandStreamEditor.

Thisapplicationreducesburdenofmanualhandlingoflogsondatabase,applicationorhistoryservers.Moreover,ithelpstopresentanalyzeddatainanimpressivewaythatleadstoeasydecisionmaking.ThisapplicationcameintodevelopmentafterdoingmuchresearchonbigdatatoolssuchasApacheSpark.Thatsavedalottimeandcostlater.Itwasdevelopedusingagiledevelopmentpractices.

Page 197: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ReviewQuestionsQ1.Describetheadvantagesofawebloganalyzer.

Q2.Describethemajorchallengesindevelopingthisapplication.

Q3:Checkoutthereferencesbelow.Identify3-4majorlessonslearnedfromthecodeandvideo.

Page 198: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 199: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 200: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Chapter10:DataMiningPrimer

Dataminingistheartandscienceofdiscoveringknowledge,insights,andpatternsindata.Itistheactofextractingusefulpatternsfromanorganizedcollectionofdata.Patternsmustbevalid,novel,potentiallyuseful,andunderstandable.Theimplicitassumptionisthatdataaboutthepastcanrevealpatternsofactivitythatcanbeprojectedintothefuture.

Dataminingisamultidisciplinaryfieldthatborrowstechniquesfromavarietyoffields.Itutilizestheknowledgeofdataqualityanddataorganizingfromthedatabasesarea.Itdrawsmodelingandanalyticaltechniquesfromstatisticsandcomputerscience(artificialintelligence)areas.Italsodrawstheknowledgeofdecision-makingfromthefieldofbusinessmanagement.

Thefieldofdataminingemergedinthecontextofpatternrecognitionindefense,suchasidentifyingafriend-or-foeonabattlefield.Likemanyotherdefense-inspiredtechnologies,ithasevolvedtohelpgainacompetitiveadvantageinbusiness.

Forexample,“customerswhobuycheeseandmilkalsobuybread90percentofthetime”wouldbeausefulpatternforagrocerystore,whichcanthenstocktheproductsappropriately.Similarly,“peoplewithbloodpressuregreaterthan160andanagegreaterthan65wereatahighriskofdyingfromaheartstroke”isofgreatdiagnosticvaluefordoctors,whocanthenfocusontreatingsuchpatientswithurgentcareandgreatsensitivity.

Pastdatacanbeofpredictivevalueinmanycomplexsituations,especiallywherethepatternmaynotbesoeasilyvisiblewithoutthemodelingtechnique.Hereisadramaticcaseofadata-drivendecision-makingsystemthatbeatsthebestofhumanexperts.Usingpastdata,adecisiontreemodelwasdevelopedtopredictvotesforJusticeSandraDayO’Connor,whohadaswingvoteina5–4dividedUSSupremeCourt.Allherpreviousdecisionswerecodedonafewvariables.Whatemergedfromdataminingwasasimplefour-stepdecisiontreethatwasabletoaccuratelypredicthervotes71percentofthetime.Incontrast,thelegalanalystscouldatbestpredictcorrectly59percentofthetime.(Source:Martinetal.2004)

Page 201: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

GatheringandselectingdataTolearnfromdata,qualitydataneedstobeeffectivelygathered,cleanedandorganized,andthenefficientlymined.Onerequirestheskillsandtechnologiesforconsolidationandintegrationofdataelementsfrommanysources.

Gatheringandcuratingdatatakestimeandeffort,particularlywhenitisunstructuredorsemistructured.Unstructureddatacancomeinmanyformslikedatabases,blogs,images,videos,audio,andchats.Therearestreamsofunstructuredsocialmediadatafromblogs,chats,andtweets.Therearestreamsofmachine-generateddatafromconnectedmachines,RFIDtags,theinternetofthings,andsoon.Eventuallythedatashouldberectangularized,thatis,putinrectangulardatashapeswithclearcolumnsandrows,beforesubmittingittodatamining.

Knowledgeofthebusinessdomainhelpsselecttherightstreamsofdataforpursuingnewinsights.Onlythedatathatsuitsthenatureoftheproblembeingsolvedshouldbegathered.Thedataelementsshouldberelevant,andsuitablyaddresstheproblembeingsolved.Theycoulddirectlyimpacttheproblem,ortheycouldbeasuitableproxyfortheeffectbeingmeasured.Selectdatacouldalsobegatheredfromthedatawarehouse.Everyindustryandfunctionwillhaveitsownrequirementsandconstraints.Thehealthcareindustrywillprovideadifferenttypeofdatawithdifferentdatanames.TheHRfunctionwouldprovidedifferentkindsofdata.Therewouldbedifferentissuesofqualityandprivacyforthesedata.

Page 202: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

DatacleansingandpreparationThequalityofdataiscriticaltothesuccessandvalueofthedataminingproject.Otherwise,thesituationwillbeofthekindofgarbageinandgarbageout(GIGO).Thequalityofincomingdatavariesbythesourceandnatureofdata.Datafrominternaloperationsislikelytobeofhigherquality,asitwillbeaccurateandconsistent.Datafromsocialmediaandotherpublicsourcesislessunderthecontrolofbusiness,andislesslikelytobereliable.

Dataalmostcertainlyneedstobecleansedandtransformedbeforeitcanbeusedfordatamining.Therearemanywaysinwhatdatamayneedtobecleansed–fillingmissingvalues,reigningintheeffectsofoutliers,transformingfields,binningcontinuousvariables,andmuchmore–beforeitcanbereadyforanalysis.Datacleansingandpreparationisalabor-intensiveorsemi-automatedactivitythatcantakeupto60-80%ofthetimeneededforadataminingproject.

Page 203: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

OutputsofDataMiningDataminingtechniquescanservedifferenttypesofobjectives.Theoutputsofdataminingwillreflecttheobjectivebeingserved.Therearemanywaysofrepresentingtheoutputsofdatamining.

Onepopularformofdataminingoutputisadecisiontree.Itisahierarchicallybranchedstructurethathelpsvisuallyfollowthestepstomakeamodel-baseddecision.Thetreemayhavecertainattributes,suchasprobabilitiesassignedtoeachbranch.Arelatedformatisasetofbusinessrules,whichareif-thenstatementsthatshowcausality.Adecisiontreecanbemappedtobusinessrules.Iftheobjectivefunctionisprediction,thenadecisiontreeorbusinessrulesarethemostappropriatemodeofrepresentingtheoutput.

Theoutputcanbeintheformofaregressionequationormathematicalfunctionthatrepresentsthebestfittingcurvetorepresentthedata.Thisequationmayincludelinearandnonlinearterms.Regressionequationsareagoodwayofrepresentingtheoutputofclassificationexercises.Thesearealsoagoodrepresentationofforecastingformulae.

Population“centroid”isastatisticalmeasurefordescribingcentraltendenciesofacollectionofdatapoints.Thesemightbedefinedinamultidimensionalspace.Forexample,acentroidcouldbe“middle-aged,highlyeducated,high-networthprofessionals,marriedwithtwochildren,livinginthecoastalareas”.Orapopulationof“20-something,ivy-league-educated,techentrepreneursbasedinSiliconValley”.Oritcouldbeacollectionof“vehiclesmorethan20yearsold,givinglowmileagepergallon,whichfailedenvironmentalinspection”.Thesearetypicalrepresentationsoftheoutputofaclusteranalysisexercise.

Businessrulesareanappropriaterepresentationoftheoutputofamarketbasketanalysisexercise.Theserulesareif-thenstatementswithsomeprobabilityparametersassociatedwitheachrule.Forexample,thosethatbuymilkandbreadwillalsobuybutter(with80percentprobability).

Page 204: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

EvaluatingDataMiningResultsTherearetwoprimarykindsofdataminingprocesses:supervisedlearningandunsupervisedlearning.Insupervisedlearning,adecisionmodelcanbecreatedusingpastdata,andthemodelcanthenbeusedtopredictthecorrectanswerforfuturedatainstances.Classificationisthemaincategoryofsupervisedlearningactivity.Therearemanytechniquesforclassification,decisiontreesbeingthemostpopularone.Eachofthesetechniquescanbeimplementedwithmanyalgorithms.Acommonmetricforallofclassificationtechniquesispredictiveaccuracy.

PredictiveAccuracy=(CorrectPredictions)/TotalPredictionsSupposeadataminingprojecthasbeeninitiatedtodevelopapredictivemodelforcancerpatientsusingadecisiontree.Usingarelevantsetofvariablesanddatainstances,adecisiontreemodelhasbeencreated.Themodelisthenusedtopredictotherdatainstances.Whenatruepositivedatapointispositive,thatisacorrectprediction,calledatruepositive(TP).Similarly,whenatruenegativedatapointisclassifiedasnegative,thatisatruenegative(TN).Ontheotherhand,whenatrue-positivedatapointisclassifiedbythemodelasnegative,thatisanincorrectprediction,calledafalsenegative(FN).Similarly,whenatrue-negativedatapointisclassifiedaspositive,thatisclassifiedasafalsepositive(FP).Thisisrepresentedusingtheconfusionmatrix(Figure4.1).

ConfusionMatrix TrueClass

Positive Negative

PredictedClass

Predictedclass

Positive

TruePositive(TP)

FalsePositive(FP)

Negative

FalseNegative(FN)

TrueNegative(TN)

Figure10.1:ConfusionMatrix

Thusthepredictiveaccuracycanbespecifiedbythefollowingformula.

PredictiveAccuracy=(TP+TN)/(TP+TN+FP+FN).

Page 205: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Allclassificationtechniqueshaveapredictiveaccuracyassociatedwithapredictivemodel.Thehighestvaluecanbe100%.Inpractice,predictivemodelswithmorethan70%accuracycanbeconsideredusableinbusinessdomains,dependinguponthenatureofthebusiness.

TherearenogoodobjectivemeasurestojudgetheaccuracyofunsupervisedlearningtechniquessuchasClusterAnalysis.Thereisnosinglerightanswerfortheresultsofthesetechniques.Forexample,thevalueofthesegmentationmodeldependsuponthevaluethedecision-makerseesinthoseresults.

Page 206: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

DataMiningTechniquesDatamaybeminedtohelpmakemoreefficientdecisionsinthefuture.Oritmaybeused to explore thedata to find interesting associativepatterns.Therighttechniquedependsuponthekindofproblembeingsolved(Figure10.2).

DataMiningTechniques

SupervisedLearning

(Predictiveabilitybasedonpastdata)

Classification–MachineLearning

DecisionTrees

NeuralNetworks

Classification-Statistics

Regression

UnsupervisedLearning

(Exploratoryanalysistodiscoverpatterns)

ClusteringAnalysis

AssociationRules

Figure10.2:ImportantDataMiningTechniques

Themostimportantclassofproblemssolvedusingdataminingareclassificationproblems.Classificationtechniquesarecalledsupervisedlearningasthereisawaytosupervisewhetherthemodelisprovidingtherightorwronganswers.Theseareproblemswheredatafrompastdecisionsisminedtoextractthefewrulesandpatternsthatwouldimprovetheaccuracyofthedecisionmakingprocessinthefuture.Thedataofpastdecisionsisorganizedandminedfordecisionrulesorequations,thatarethencodifiedtoproducemoreaccuratedecisions.

Decisiontreesarethemostpopulardataminingtechnique,formanyreasons.

1. Decisiontreesareeasytounderstandandeasytouse,byanalystsaswellasexecutives.Theyalsoshowahighpredictiveaccuracy.

2. Decisiontreesselectthemostrelevantvariablesautomaticallyoutofalltheavailablevariablesfordecisionmaking.

3. Decisiontreesaretolerantofdataqualityissuesanddonotrequiremuchdatapreparationfromtheusers.

4. Evennon-linearrelationshipscanbehandledwellbydecisiontrees.

Therearemanyalgorithmstoimplementdecisiontrees.Someofthepopular

Page 207: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

onesareC5,CARTandCHAID.

Regressionisamostpopularstatisticaldataminingtechnique.Thegoalofregressionistoderiveasmoothwell-definedcurvetobestthedata.Regressionanalysistechniques,forexample,canbeusedtomodelandpredicttheenergyconsumptionasafunctionofdailytemperature.Simplyplottingthedatamayshowanon-linearcurve.Applyinganon-linearregressionequationwillfitthedataverywellwithhighaccuracy.Oncesucharegressionmodelhasbeendeveloped,theenergyconsumptiononanyfuturedaycanbepredictedusingthisequation.Theaccuracyoftheregressionmodeldependsentirelyuponthedatasetusedandnotatallonthealgorithmortoolsused.

ArtificialNeuralNetworks(ANN)isasophisticateddataminingtechniquefromtheArtificialIntelligencestreaminComputerScience.Itmimicsthebehaviorofhumanneuralstructure:Neuronsreceivestimuli,processthem,andcommunicatetheirresultstootherneuronssuccessively,andeventuallyaneuronoutputsadecision.Adecisiontaskmaybeprocessedbyjustoneneuronandtheresultmaybecommunicatedsoon.Alternatively,therecouldbemanylayersofneuronsinvolvedinadecisiontask,dependinguponthecomplexityofthedomain.Theneuralnetworkcanbetrainedbymakingadecisionoverandoveragainwithmanydatapoints.Itwillcontinuetolearnbyadjustingitsinternalcomputationandcommunicationparametersbasedonfeedbackreceivedonitspreviousdecisions.Theintermediatevaluespassedwithinthelayersofneuronsmaynotmakeanyintuitivesensetoanobserver.Thus,theneuralnetworksareconsideredablack-boxsystem.

ClusterAnalysisisanexploratorylearningtechniquethathelpsinidentifyingasetofsimilargroupsinthedata.Itisatechniqueusedforautomaticidentificationofnaturalgroupingsofthings.Datainstancesthataresimilarto(ornear)eachotherarecategorizedintoonecluster,whiledatainstancesthatareverydifferent(orfaraway)fromeachotherarecategorizedintoseparateclusters.Therecanbeanynumberofclustersthatcouldbeproducedbythedata.TheK-meanstechniqueisapopulartechniqueandallowstheuserguidanceinselectingtherightnumber(K)ofclustersfromthedata.Clusteringisalsoknownasthesegmentationtechnique.Ithelpsdivideandconquerlargedatasets.Thetechniqueshowstheclustersofthingsfrompastdata.Theoutputisthecentroidsforeachclusterandtheallocationofdatapointstotheircluster.Thecentroiddefinitionisusedtoassignnewdatainstancescanbeassignedtotheirclusterhomes.Clusteringisalsoapartoftheartificialintelligencefamilyoftechniques.

Associationrulesareapopulardataminingmethodinbusiness,especially

Page 208: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

wheresellingisinvolved.Alsoknownasmarketbasketanalysis,ithelpsinansweringquestionsaboutcross-sellingopportunities.ThisistheheartofthepersonalizationengineusedbyecommercesiteslikeAmazon.comandstreamingmoviesiteslikeNetflix.com.Thetechniquehelpsfindinterestingrelationships(affinities)betweenvariables(itemsorevents).ThesearerepresentedasrulesoftheformX Y,whereXandYaresetsofdataitems.Aformofunsupervisedlearning,ithasnodependentvariable;andtherearenorightorwronganswers.Therearejuststrongerandweakeraffinities.Thus,eachrulehasaconfidencelevelassignedtoit.Apartofthemachinelearningfamily,thistechniqueachievedlegendarystatuswhenafascinatingrelationshipwasfoundinthesalesofdiapersandbeers.

Page 209: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

MiningBigDataAsdatagrowslargerandlarger,thereareafewwaysinwhichanalyzingBigdataisdifferent.FromCausationtoCorrelation

Thereismoredataavailablethantherearetheoriesandtoolsavailabletoexplainit.Historically,theoriesofhumanbehavior,andtheoriesofuniverseingeneral,havebeenintuitedandtestedusinglimitedandsampleddata,withsomestatisticalconfidencelevel.Nowthatdataisavailableinextremelylargequantitiesaboutmanypeopleandmanyfactors,theremaybetoomuchnoiseinthedatatoarticulateandtestcleantheories.Inthatcase,itmaysufficetovalueco-occurrencesorcorrelationofeventsassignificantwithoutnecessarilyestablishingstrongcausation.FromSamplingtotheWhole

Poolingallthedatatogetherintoasinglebigdatasystemcanhelpdiscoverevents,thathelpbringaboutafullerpictureofthesituation,andhighlightthreatsoropportunitiesthatanorganizationfaces.Workingfromthefulldatasetcanenablediscoveringremotebutextremelyvaluableinsights.Forexample,ananalysisofthepurchasinghabitsofmillionscustomersandtheirbillionstransactionsattheirthousandsofstorescangiveanorganizationavast,detailedanddynamicviewofsalespatternsintheircompany,whichmaynotbeavailablefromtheanalysisofsmallsamplesofdatabyeachstoreorregion.FromDatasettoDatastream

Aflowingstreamhasaperishableandunlimitedconnotationtoit,whileadatasethasafinitudeandpermanenceaboutit.Withanygiveninfrastructure,onecanonlyconsumesomuchdataatatime.Datastreamsaremany,largeandfast.Thusonehastochoosewhichofthemanystreamsofdatadoesonewanttoengagewith.Itisequivalenttodecidingwhichstreamtofishin.Themetricsusedforanalysisofstreamstendtoberelativelysimpleandrelatetotimedimension.Mostofthemetricsarestatisticalmeasuressuchascountsandmeans.Forexample,acompanymightwanttomonitorcustomersentimentaboutitsproducts.Sotheycouldcreateasocialmedialisteningplatformthatwouldreadalltweetsandblogpostsabouttheminreal-time.Thisplatformwould(a)keepacountofpositiveandnegativesentimentmessageseveryminute,and(b)flaganymessagesthatmeritattentionsuchassendinganonlineadvertisementorpurchaseoffertothatcustomer.

Page 210: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

DataMiningBestPracticesEffectiveandsuccessfuluseofdataminingactivityrequiresbothbusinessandtechnologyskills.Thebusinessaspectshelpunderstandthedomainandthekeyquestions.Italsohelpsoneimaginepossiblerelationshipsinthedata,andcreatehypothesestotestit.TheITaspectshelpfetchthedatafrommanysources,cleanupthedata,assembleittomeettheneedsofthebusinessproblem,andthenrunthedataminingtechniquesontheplatform.

Animportantelementistogoaftertheproblemiteratively.Itisbettertodivideandconquertheproblemwithsmalleramountsofdata,andgetclosertotheheartofthesolutioninaniterativesequenceofsteps.Thereareseveralbestpracticeslearnedfromtheuseofdataminingtechniquesoveralongperiodoftime.TheDataMiningindustryhasproposedaCross-IndustryStandardProcessforDataMining(CRISP-DM).Ithassixessentialsteps(Figure4.3):

1. BusinessUnderstanding:Thefirstandmostimportantstepindataminingisaskingtherightbusinessquestions.Aquestionisagoodoneifansweringitwouldleadtolargepayoffsfortheorganization,financiallyandotherwise.Inotherwords,selectingadataminingprojectislikeanyotherproject,inthatitshouldshowstrongpayoffsiftheprojectissuccessful.Thereshouldbestrongexecutivesupportforthedataminingproject,whichmeansthattheprojectalignswellwiththebusinessstrategy.Arelatedimportantstepistobecreativeandopeninproposingimaginativehypothesesforthesolution.Thinkingoutsidetheboxisimportant,bothintermsofaproposedmodelaswellinthedatasetsavailableandrequired.

Page 211: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Figure4.3:CRISP-DMDataMiningcycle

2. DataUnderstanding:Arelatedimportantstepistounderstandthedataavailableformining.Oneneedstobeimaginativeinscouringformanyelementsofdatathroughmanysourcesinhelpingaddressthehypothesestosolveaproblem.Withoutrelevantdata,thehypothesescannotbetested.

3. DataPreparation:Thedatashouldberelevant,cleanandofhighquality.It’simportanttoassembleateamthathasamixoftechnicalandbusinessskills,whounderstandthedomainandthedata.Datacleaningcantake60-70%ofthetimeinadataminingproject.Itmaybedesirabletocontinuetoexperimentandaddnewdataelementsfromexternalsourcesofdatathatcouldhelpimprovepredictiveaccuracy.

4. Modeling:Thisistheactualtaskofrunningmanyalgorithmsusingtheavailabledatatodiscoverifthehypothesesaresupported.Patienceisrequiredincontinuouslyengagingwiththedatauntilthedatayieldssomegoodinsights.Ahostofmodelingtoolsandalgorithmsshouldbeused.Atoolcouldbetriedwithdifferentoptions,suchasrunningdifferentdecisiontreealgorithms.

5. ModelEvaluation:Oneshouldnotacceptwhatthedatasaysatfirst.Itisbettertotriangulatetheanalysisbyapplyingmultipledataminingtechniques,andconductingmanywhat-ifscenarios,tobuildconfidenceinthesolution.Oneshouldevaluateandimprovethemodel’spredictiveaccuracywithmoretestdata.Whentheaccuracyhasreachedsomesatisfactorylevel,thenthemodelshouldbedeployed.

Page 212: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

6. Disseminationandrollout:Itisimportantthatthedataminingsolutionispresentedtothekeystakeholders,andisdeployedintheorganization.Otherwisetheprojectwillbeawasteoftimeandwillbeasetbackforestablishingandsupportingadata-baseddecision-processcultureintheorganization.Themodelshouldbeeventuallyembeddedintheorganization’sbusinessprocesses.

Page 213: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ConclusionDataMiningislikedivingintotheroughmaterialtodiscoveravaluablefinishednugget.Whilethetechniqueisimportant,domainknowledgeisalsoimportanttoprovideimaginativesolutionsthatcanthenbetestedwithdatamining.Thebusinessobjectiveshouldbewellunderstoodandshouldalwaysbekeptinmindtoensurethattheresultsarebeneficialtothesponsoroftheexercise.

Page 214: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

ReviewQuestions1. Whatisdatamining?Whataresupervisedandunsupervisedlearning

techniques?2. Describethekeystepsinthedataminingprocess.Whyisitimportant

tofollowtheseprocesses?3. Whatisaconfusionmatrix?4. Whyisdatapreparationsoimportantandtimeconsuming?5. Whataresomeofthemostpopulardataminingtechniques?6. HowisminingBigdatadifferentfromtraditionaldatamining?

Page 215: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 216: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Appendix1:HadoopInstallationonAmazonWebServices(AWS)ElasticComputeCluster(EC2)

Page 217: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

CreatingClusterserveronAWS,InstallHadoopfromCloudEraTheobjectiveofthistutorialistosetupabigdataprocessinginfrastructureusingcloudcomputing,andHadoopandSparksoftware.

Page 218: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Step1:CreatingAmazonEC2Servers.

1. Openhttps://aws.amazon.com/2. ClickonServices3. ClickonEC2

YoucanseethebelowresultonceyouclickonEC2.Ifyoualreadyhaveaserveryoucanseethenumberofrunningservers,theirvolumeandotherinformation.

4. ClickonLaunchInstanceButton.

Page 219: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

5. ClickonAWSMarketePlace6. TypeUbuntuinsearchtextbox.7. ClickonSelectbutton

8. Ubuntuisfreesoyoudon’thavetoworryabouttheservicepriceClickonContinuebutton.

Page 220: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

9. ChooseGeneral.purposem1.largeandclickonNext:ConfigurareInstanceDetails(DonotchoosetheMicroInstancest1.microitisfreebutitwillnotabletohandletheinstallation.)

10. ClickonNext:AddStorage

11. Specifythevolumesize20GB(Defaultwillbe8butitwillnotsufficient)andClickonNext:TagInstance

12. Typethenamecs488-master(Thisisforlabeltoknowwhichoneismasterandslave)andclickonNext:SecurityGroup

Page 221: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

13. Weneedtoopenourservertotheworldincludingmostoftheportcauseclouderaneedtoaddmoreport.SpecifythegroupnameType:ChooseCustomTCPRulePortRange0-65500Source:AnyWhereAndClickonReviewInstance

14. Themessageshowsthewarningthisisonlythatweopenourservertoworld,Soignoreitfornow.ClickonLaunchbutton.

15. TypethekeypairnameandClickonDownloadKeyPairbutton(rememberthelocationofdownloadedfileweneedthisfiletologintotheserver.)andClickonLaunchInstances.

Page 222: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

16. Nowthemasterserveriscreated.

Now,weneedfourmoreserverstomaketheclusteringforthatwedon’tneedtodotheseprocessfourtimes.Wejustincreasethevalueofnoofinstanceweneedandwegotthe4servers.

Nowwearegoingtolaunch4moreserverwhichisslaves.

Pleaserepeatstep4-9

Gotoamazonmarketplace,chooseUbuntu,selecttheinstancetype(General.purpose)

17. Type4inNumberofInstances.Whichwillcreatethe4moreserverforus.

18. Nametheservercs488-slave

19. Selectthepreviouscreatedsecuritygroup.

Page 223: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

20. Itisimportantthatyouneedtochoosetheexistingkeypairfortheseservertoo.

Ifeverythinggoeswell,youcanseehave5instances,5volumes,1keypair,1or2securitygroups.

Wearenowsuccessfullycreated5servers.

Page 224: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Step2:ConnectingserverandinstallingrequiredClouderadistributionofHadoop

Firstofalltakeanoteforallyourserverdetails,IPAddress,DNSaddress.Masterandslaves.

MasterPublicDNSAddress:ec2-54-200-210-141.us-west-2.compute.amazonaws.comMasterPrivateIPAddress:172.31.20.82

Slave1PrivateIP:172.31.26.245Slave2PrivateIP:172.31.26.242Slave3PrivateIP:172.31.26.243Slave4PrivateIP:172.31.26.244

Onceyouhavetheseinrecorded,youcanconnecttotheserver.Ifyouareusinglinuxasoperatingsystemyoucanusesshcommandfromterminaltoconnectit.

Connectingtheserver(Windows)

1. Downloadthesshsoftware(Putty)(http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html)Alsodownloadputtygentoconvertourauthenticationfile.pemto.ppk

Page 225: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

2. Openputtygenloadtheauthenticationfile

ClickonSavePrivateKey

3. OpenPuttytypethemasterpublicdnsaddressinhostnameandthanclickonSSHfromleftpanel>ClickonAuth>>Selecttherecentconvertedauthenticationfile(.ppk)andfinallyclickonOpenbutton.

Page 226: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

4. Nowyouwillabletoconnecttheserverpleasetype“ubuntu”thedefaultusernametologintothesystem.

5. Onceyouconnecttypethefollowingcommandintotheterminal6. sudoaptitudeupdate7. cd/usr/local/src/8. sudowgethttp://archive.cloudera.com/cm4/installer/latest/cloudera-manager-

installer.bin9. sudochmodu+xcloudera-manager-installer.bin10. sudo./cloudera-manager-installer.bin

Page 227: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

11. Thereis4morestepwhereyouclickonNextandYesforlicenseagreement.Onceyoufinishtheinstallationyouneedtorestarttheservice.

12. sudoservicecloudera-scm-serverrestart

Youarenowabletoconnecttheclouderafromyoubrowser.Theaddresswillbehttp://<YOURPUBLICDNSSERVER>:7180eg.http://ec2-54-200-210-141.us-west-2.compute.amazonaws.com:7180anddefaultusernameandpasswordisadmin/admintologintothesystem.

Page 228: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Oncerestarttheserveritwillopentheloginscreenagain.Thesameusernameandpassword(admin/admin)isusetologintothesystem.

13. ClickonLaunchtheClassicwizard

14. ClickonContinue

Page 229: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

15. ProvideallthePrivateIPaddressofmasterandslavescomputersandclickonSearchbutton.

16. ClickonContinuebutton.

17. ChooseNoneforSOLR1….AndNoneforIMPAL….AndClickonContinuebutton.

Page 230: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

18. ClickonAnotherUser>>Type“ubuntu”andselectAllhostsacceptsameprivatekey>>uploadtheauthenticationfile.pemandclickonContinuebutton.

19. Nowclouderawillinstallthesoftwareforeachofourserver.

20. Oncetheinstallationiscompleteclickoncontinuebutton.

21. Onceitreachto100%clickoncontinuebutton.Donotdisconnectinternetnorshutthemachine,Iftheprocesswillnotcompletethatweneedtore-createthewholeprocess.Clickoncontinuebutton.

Page 231: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

22. ClickonContinue.

23. ChooseCoreHadoopandClickonInspectRoleAssignmentsbutton

Page 232: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

24. NowforyoumasterIPitshouldhaveonlyNameNodeselectionanduncheckedinDataNode.Thisisimportanttomakethemasterandslaveserver.

25. Nowtheclouderawillinstallthealltheservicesforyoufutureuseyoucanrecordtheusernameandpasswordofeachservices.ClickonTestConnection

Page 233: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

26. ClickonContinue

Page 234: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

27. Nowalltheinstallationiscompleteyoucannowhave1masternode4datanode.

Page 235: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

28. Youshouldseethedashboard.

Page 236: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Step3:WordCountusingMapReduce29. Nowlogintomasterserverfromputty.30. Runthefollowingcommand31. cd~/32. mkdircode-and-data33. cdcode-and-data34. sudowgethttps://s3.amazonaws.com/learn-hadoop/hadoop-infiniteskills-

richmorrow-class.tgz35. sudotar-xvzfhadoop-infiniteskills-richmorrow-class.tgz36. cddata37. sudo-uhdfshadoopfs-mkdir/user/ubuntu38. sudo-uhdfshadoopfs-chownubuntu/user/ubuntu39. hadoopfs-putshakespeareshakespeare-hdfs40. hadoopversion41. hadoopfs-lsshakespeare-hdfs

42. sudohadoopjar/opt/cloudera/parcels/CDH-4.7.1-

1.cdh4.7.1.p0.47/share/hue/apps/oozie/examples/lib/hadoop-examples.jarwordcountshakespeare-hdfswordcount-output

43. hadoopjar/opt/cloudera/parcels/CDH-4.7.1-1.cdh4.7.1.p0.47/share/hue/apps/oozie/examples/lib/hadoop-examples.jarsleep-m10-r10-mt20000-rt20000

Page 237: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 238: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 239: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Appendix2:SparkInstallationandTutorial

ThistutorialwillhelpinstallSparkandgetitrunningonastandalonemachine.ItwillthenhelpdevelopasimpleanalyticalapplicationusingRlanguage.

Page 240: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Step1:VerifyingJavaInstallation

JavainstallationisoneofthemandatorythingsininstallingSpark.TrythefollowingcommandtoverifytheJAVAversion.

$java-version

IfJavaisalready,installedonyoursystem,yougettoseethefollowingresponse−

javaversion“1.7.0_71”

Java(TM)SERuntimeEnvironment(build1.7.0_71-b13)

JavaHotSpot(TM)ClientVM(build25.0-b02,mixedmode)

IncaseyoudonothaveJavainstalledonyoursystem,thenInstallJavabeforeproceedingtonextstep.

Page 241: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Step2:VerifyingScalainstallation

VerifyScalainstallationusingfollowingcommand.

$scala-version

IfScalaisalreadyinstalledonyoursystem,yougettoseethefollowingresponse−

Scalacoderunnerversion2.11.6—Copyright2002-2013,LAMP/EPFL

Incaseyoudon’thaveScalainstalledonyoursystem,thenproceedtonextstepforScalainstallation.

Page 242: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Step3:DownloadingScala

DownloadthelatestversionofScalabyvisitthefollowinglinkDownloadScala.Forthistutorial,weareusingscala-2.11.6version.Afterdownloading,youwillfindtheScalatarfileinthedownloadfolder.

Page 243: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Step4:InstallingScala

FollowthebelowgivenstepsforinstallingScala.ExtracttheScalatarfile

TypethefollowingcommandforextractingtheScalatarfile.

$tarxvfscala-2.11.6.tgzMoveScalasoftwarefiles

UsethefollowingcommandsformovingtheScalasoftwarefiles,torespectivedirectory(/usr/local/scala).

$su–

Password:

#cd/home/Hadoop/Downloads/

#mvscala-2.11.6/usr/local/scala

#exit

SetPATHforScala

UsethefollowingcommandforsettingPATHforScala.

$exportPATH=$PATH:/usr/local/scala/binVerifyingScalaInstallation

Afterinstallation,itisbettertoverifyit.UsethefollowingcommandforverifyingScalainstallation.

$scala-version

IfScalaisalreadyinstalledonyoursystem,yougettoseethefollowingresponse−

Scalacoderunnerversion2.11.6—Copyright2002-2013,LAMP/EPFL

Page 244: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Step5:DownloadingSpark

DownloadthelatestversionofSpark.Forthistutorial,weareusingspark-1.3.1-bin-hadoop2.6version.Afterdownloadingit,youwillfindtheSparktarfileinthedownloadfolder.

Page 245: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Step6:InstallingSpark

FollowthestepsgivenbelowforinstallingSpark.ExtractingSparktar

Thefollowingcommandforextractingthesparktarfile.

$tarxvfspark-1.3.1-bin-hadoop2.6.tgzMovingSparksoftwarefiles

ThefollowingcommandsformovingtheSparksoftwarefilestorespectivedirectory(/usr/local/spark).

$su–

Password:

#cd/home/Hadoop/Downloads/

#mvspark-1.3.1-bin-hadoop2.6/usr/local/spark

#exit

SettinguptheenvironmentforSpark

Addthefollowinglineto~/.bashrcfile.Itmeansaddingthelocation,wherethesparksoftwarefilearelocatedtothePATHvariable.

exportPATH=$PATH:/usr/local/spark/bin

Usethefollowingcommandforsourcingthe~/.bashrcfile.

$source~/.bashrc

Page 246: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Step7:VerifyingtheSparkInstallation

WritethefollowingcommandforopeningSparkshell.

$spark-shell

Ifsparkisinstalledsuccessfullythenyouwillfindthefollowingoutput.

SparkassemblyhasbeenbuiltwithHive,includingDatanucleusjarsonclasspath

UsingSpark’sdefaultlog4jprofile:org/apache/spark/log4j-defaults.properties

15/06/0415:25:22INFOSecurityManager:Changingviewaclsto:hadoop

15/06/0415:25:22INFOSecurityManager:Changingmodifyaclsto:hadoop

15/06/0415:25:22INFOSecurityManager:SecurityManager:authenticationdisabled;

uiaclsdisabled;userswithviewpermissions:Set(hadoop);userswithmodifypermissions:Set(hadoop)

15/06/0415:25:22INFOHttpServer:StartingHTTPServer

15/06/0415:25:23INFOUtils:Successfullystartedservice‘HTTPclassserver’onport43292.

WelcometoSparkversion1.4.0

UsingScalaversion2.10.4(JavaHotSpot(TM)64-BitServerVM,Java1.7.0_71)

Typeinexpressionstohavethemevaluated.

Sparkcontextavailableassc

scala>

Hereyoucanseethevideo:

HowtoinstallSpark

Youmightencounter“filespecifiednotfounderror”whenyouarefirstinstallingSPARKstandalone:

Page 247: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

TofixthisyouhavetosetupyourJAVA_HOME

Step1:Start->run->commandprompt(cmd)

Step2:DeterminewhereisyourJDKislocated,bydefaultitisinyourC:\programfiles

Step3SelectyourJDKtouseinmycase,IwillusemyJDK_8

CopythedirectorytoyourclipboardandgotoyourCMD.Andpressenter.

Page 248: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Step4:AddittogeneralPATH

Andpressenter.

NowgotoyoursparkfolderandgotoBIN\spark_shell

Youhaveinstalledsparklet’strytouseit.

Page 249: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

Step8:Application:WordCountinScala

NowwewilldoanexampleofwordcountinScala:

text_file=sc.textFile(“hdfs://…”)

counts=text_file.flatMap(lambdaline:line.split(”“))\

.map(lambdaword:(word,1))\

.reduceByKey(lambdaa,b:a+b)

counts.saveAsTextFile(“hdfs://…”)

NOTE:Ifyouareworkingonastand-aloneSpark:

Thiscounts.saveAsTextFile(“hdfs://…”)commandwillgiveyouanerrorofNullPointerException.

Solution:counts.coalesce(1).saveAsTextFile()

ForimplementingwordcloudwecoulduseRinoursparkconsole:

However,ifyouclickonSparkRstraightawayyouwillgetanerror.

Tofixthis:

Step1:Setuptheenvironmentvariables.

InthePATHVariableaddyourpath:Iadded->;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\sbin;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\bin

Step2:InstallRsoftwareandRstudio.ThenaddthepathofRsoftwarepathtothePATHvariable.

Iaddedthistomyexistingpath->;C:\ProgramFiles\R\R-3.2.2\bin\x64\(Remembereachpaththatyouaddmustbeseparatedbysemicolonandnospacesplease)

Step3:Runcommandpromptasanadministrator.

Step4:Nowexecutethecommand>“SparkR”fromthecommandprompt.Ifsuccessfulyoushouldseemessage“Sparkcontextisavailable…”asseen

Page 250: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

below.IfyoupathisnotsetcorrectlyyoucanalternativelynavigatetothelocationwhereyouhavedownloadedSparkR.Inmycase(C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\bin)andexecute“SparkR”Command.

Step5:ConfigurationinsidetheRStudiotoconnecttoSpark!

ExecutethebelowthreecommandsinRstudioeverytime:

#HerewearesettingupSPARK_HOMEenvironmentvariable

Sys.setenv(SPARK_HOME=“C:/spark-1.5.1-bin-hadoop2.6/spark-1.5.1-bin-hadoop2.6”)

#Setthelibrarypath

.libPaths(c(file.path(Sys.getenv(“SPARK_HOME”),“R”,“lib”),.libPaths()))

#LoadingtheSparkRLibary

library(SparkR)

IfyouseethebelowmessagethenyouareallsettostartworkingwithSparkR

Nowlet’sStartCodinginR:

Page 251: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

lords<-Corpus(DirSource(“temp/”))

Toseewhat’sinthatcorpus,typethecommand

inspect(lords)

Thisshouldprintoutcontentsonthemainscreen.Next,weneedtocleanitup.Executethefollowinginthecommandline,onelineatatime:

lords<-tm_map(lords,stripWhitespace)

lords<-tm_map(lords,tolower)

lords<-tm_map(lords,removeWords,stopwords(“english”))

lords<-tm_map(lords,stemDocument)

Thetm_mapfunctioncomeswiththetmpackage.Thevariouscommandsareself-explanatory:stripunnecessarywhitespace,converteverythingtolowercase(otherwisethewordcloudmighthighlightcapitalisedwordsseparately),removeEnglishcommonwordslike‘the’(so-called‘stopwords’),andcarryouttextstemmingforthefinaltidy-up.DependingonwhatyouwanttoachieveyoucouldalsoexplicitlyremovenumbersandpunctuationwiththeremoveNumbersandremovePunctuationarguments.

Itispossiblethatyoumaygeterrormessageswhilstexecutingsomeofthecommands,e.g.missingpackages.IfsoinstalltheseasoutlinedaboveinStep4,andrepeat

Ifalliswellthenyoushouldnowbereadytocreateyourfirstwordcloud!Trythis:

wordcloud(lords,scale=c(5,0.5),max.words=100,random.order=FALSE,rot.per=0.35,use.r.layout=FALSE,colors=brewer.pal(8,“Dark2”))

Page 252: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 253: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 254: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

AdditionalResourcesHerearesomeotherbooks,papers,videoandotherresources,foradeeperdiveintothetopicscoveredinthisbook.

1. Mayer-Schonberger,Viktor;Cukier,Kenneth(2013).BigData:ARevolutionThatWillTransformHowWeLive,Work,andThink.HoughtonMifflinHarcourt.

2. McKinseyGlobalInstituteReport(2011).Bigdata:Thenextfrontierforinnovation,competition,andproductivity.Mckinsey.com

3. Silver,N.(2012).TheSignalandtheNoise:WhySoManyPredictionsFailbutSomeDon’t.PenguinPress.

4. MateiZahariaandet.Al.(2010).“ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing,”UniversityofCalifornia,Berkeley.OReilley.

5. SandyRyza,UriLasersonet.al(2014).“Advanced-Analytics-with-Spark”.OReilley.

Websites:

6. ApacheHadoopresources:https://hadoop.apache.org/docs/r2.7.2/7. ApacheHDFS:https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html8. HadoopAPIsite:http://hadoop.apache.org/docs/current/api/9. ApacheSpark:http://spark.apache.org/docs/latest/

10.https://www.biostat.wisc.edu/~kbroman/Rintro/Rwinpack.html11.http://robjhyndman.com/hyndsight/building-r-packages-for-windows/12.https://stevemosher.wordpress.com/ten-steps-to-building-an-r-package-under-windows/13.http://www.inside-r.org/packages/cran/wordcloud/docs/wordcloud14.https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html15.https://intellipaat.com/tutorial/spark-tutorial/16.https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces5017.https://en.wikipedia.org/wiki/NoSQL

Page 255: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

18.http://www.planetcassandra.org/what-is-apache-cassandra/19.http://www.datastax.com/nosql20.https://www.sitepen.com/blog/2010/05/11/nosql-architecture/21.http://nosql-database.org/22.http://webpages.uncc.edu/xwu/5160/nosqldbs.pdf

VideoResources

23.DougCuttingon‘Hadoopat10’:https://www.youtube.com/watch?v=yDZRDDu3CJo24.StatusofApachecommunity:https://www.youtube.com/watch?v=sOZnf8Nn3Fo.25.Spark2.0updatesshowinganicedemoacrossR,ScalaandSQL)usingtweetsandclustering.https://www.youtube.com/watch?v=9xSz0ppBtFg26.https://www.youtube.com/watch?v=VwiGHUKAHWM27.https://www.youtube.com/watch?v=L5QWO8QBG5c28.https://www.youtube.com/watch?v=KvQto_b3sqw29.https://www.youtube.com/watch?v=YW28qItH_tA

Page 256: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example
Page 257: Big Data Essentials - pdf.ebook777.compdf.ebook777.com/032/B01HPFZRBY.pdf · Understanding the essentials of Big Data requires ... as Hadoop, MapReduce, Spark ... Kafka Producer example

AbouttheAuthorDr.AnilMaheshwariisaProfessorofComputerScienceandInformationSystems,andtheDirectorofCenterforDataAnalytics,atMaharishiUniversityofManagement.Heteachescoursesindataanalytics,andhelpswithextractingdeepinsightsfromtheirdata.HeworkedinavarietyofleadershiprolesatIBMinAustinTX,andhasalsoworkedatmanyothercompaniesincludingstartups.

HehastaughtattheUniversityofCincinnati,CityUniversityofNewYork,UniversityofIllinois,andothers.HeearnedanElectricalEngineeringdegreefromIndianInstituteofTechnologyinDelhi,anMBAfromIndianInstituteofManagementinAhmedabad,andaPh.D.fromCaseWesternReserveUniversity.HeisapractitionerofTranscendentalMeditationtechnique.

Heistheauthorofthe#1bestsellerDataAnalyticsMadeAccessible.

HeblogsinterestingstuffonITandEnlightenmentatanilmah.com

Instructorscanreachhimforcoursematerialsatakm2030@gmail.com.Speakingengagementsarewelcome.