big data essentials - pdf.ebook777.compdf.ebook777.com/032/b01hpfzrby.pdf · understanding the...

BigDataEssentialsCopyright©2016byAnilK.Maheshwari,Ph.D.

Bypurchasingthisbook,youagreenottocopyordistributethebookbyanymeans,mechanicalorelectronic.

Nopartofthisbookmaybecopiedortransmittedwithoutwrittenpermission.

Otherbooksbythesameauthor:

DataAnalyticsMadeAccessiblethe#1BestsellerinDataMining

Moksha:LiberationThroughTranscendence

https://www.amazon.com/dp/B01E88QUB2

PrefaceBigDataisanew,andinclusive,naturalphenomenon.Itisasmessyasnatureitself.ItrequiresanewkindofConsciousnesstofathomitsscaleandscope,anditsmanyopportunitiesandchallenges.UnderstandingtheessentialsofBigDatarequiressuspendingmanyconventionalexpectationsandassumptionsaboutdata…suchascompleteness,clarity,consistency,andconciseness.Fathomingandtamingthemulti-layeredBigDataisadreamthatisslowlybecomingareality.Itisarapidlyevolvingfieldthatisgrowingexponentiallyinvalueandcapabilities.

ThereisagrowingnumberofbooksbeingwrittenonBigData.Theyfallmostlyintwocategories.Thefirstkindfocusonbusinessaspects,anddiscussthestrategicinternalshiftsrequiredforreapingthebusinessbenefitsfromthemanyopportunitiesofferedbyBigData.Thesecondkindfocusonparticulartechnologyplatforms,suchasHadooporSpark.Thisbookaimstobringtogetherthebusinesscontextandthetechnologiesinaseamlessway.

ThisbookwaswrittentomeettheneedsforanintroductoryBigDatacourse.Itismeantforstudents,aswellasexecutives,whowishtotakeadvantageofemergingopportunitiesinBigData.Itprovidesanintuitionofthewholenessofthefieldinasimplelanguage,freefromjargonandcode.AlltheessentialBigDatatechnologytoolsandplatformssuchasHadoop,MapReduce,Spark,andNoSqlarediscussed.MostoftherelevantprogrammingdetailshavebeenmovedtoAppendicestoensurereadability.Theshortchaptersmakeiteasytoquicklyunderstandthekeyconcepts.AcompletecasestudyofdevelopingaBigDataapplicationisincluded.

ThankstoMaharishiMaheshYogiforcreatingawonderfuluniversitywhoseconsciousness-basedenvironmentmadewritingthisevolutionarybookpossible.Thankstomanycurrentandformerstudentsforcontributingtothisbook.DheerajPandeyassistedwiththeWebloganalyzerapplicationanditsdetails.SurajThapaliaassistedwiththeHadoopinstallationguide.EnkhbilegTseeleesurenhelpedwritetheSparktutorial.Thankstomyfamilyforsupportingmeinthisprocess.MydaughtersAnkitaandNupurreviewedthebookandmadehelpfulcomments.MyfatherMr.RLMaheshwariandbrotherDr.SunilMaheshwarialsoreadthebookandenthusiasticallyapprovedit.MycolleagueDr.EdiShivajitooreviewedthebook.

MaytheBigDataForcebewithyou!

Dr.AnilMaheshwari

August2016,Fairfield,IA

ContentsPreface

Chapter1–WholenessofBigData

Introduction

UnderstandingBigData

CASELET:IBMWatson:ABigDatasystem

CapturingBigData

VolumeofData

VelocityofData

VarietyofData

VeracityofData

BenefittingfromBigData

ManagementofBigData

OrganizingBigData

AnalyzingBigData

TechnologyChallengesforBigData

StoringHugeVolumes

Ingestingstreamsatanextremelyfastpace

Handlingavarietyofformsandfunctionsofdata

Processingdataathugespeeds

ConclusionandSummary

Organizationoftherestofthebook

ReviewQuestions

LibertyStoresCaseExercise:StepB1

Section1

Chapter2-BigDataApplications

Introduction

CASELET:BigDataGetstheFlu

BigDataSources

PeopletoPeopleCommunications

SocialMedia

PeopletoMachineCommunications

Webaccess

MachinetoMachine(M2M)Communications

RFIDtags

Sensors

BigDataApplications

MonitoringandTrackingApplications

AnalysisandInsightApplications

NewProductDevelopment

Conclusion

ReviewQuestions


Chapter3-BigDataArchitecture

Introduction

CASELET:GoogleQueryArchitecture

StandardBigdataarchitecture

BigDataArchitectureexamples

IBMWatson

Netflix

Ebay

VMWare

TheWeatherCompany

TicketMaster

LinkedIn

Paypal

CERN

Conclusion

ReviewQuestions


Section2

Chapter4:DistributedComputingusingHadoop

Introduction

HadoopFramework

HDFSDesignGoals

Master-SlaveArchitecture

Blocksystem

EnsuringDataIntegrity

InstallingHDFS

ReadingandWritingLocalFilesintoHDFS

ReadingandWritingDataStreamsintoHDFS

SequenceFiles

YARN

Conclusion

ReviewQuestions

Chapter5–ParallelProcessingwithMapReduce

Introduction

MapReduceOverview

MapReduceprogramming

MapReduceDataTypesandFormats

WritingMapReduceProgramming

TestingMapReducePrograms

MapReduceJobsExecution

HowMapReduceWorks

ManagingFailures

ShuffleandSort

ProgressandStatusUpdates

HadoopStreaming

Conclusion

ReviewQuestions

Chapter6–NoSQLdatabases

Introduction

RDBMSVsNoSQL

TypesofNoSQLDatabases

ArchitectureofNoSQL

CAPtheorem

PopularNoSQLDatabases

HBase

ArchitectureOverview

ReadingandWritingData

Cassandra



HiveLanguage

HIVELanguageCapabilities

PigLanguage

Conclusion

ReviewQuestions

Chapter7–StreamProcessingwithSpark

Introduction

SparkArchitecture

ResilientDistributedDatasets(RDD)

DirectedAcyclicGraph(DAG)

SparkEcosystem

Sparkforbigdataprocessing

MLlib

SparkGraphX

SparkR

SparkSQL

SparkStreaming

Sparkapplications

SparkvsHadoop

Conclusion

ReviewQuestions

Chapter8–IngestingData

Wholeness

MessagingSystems

PointtoPointMessagingSystem

Publish-SubscribeMessagingSystem

ApacheKafka

UseCases

KafkaArchitecture

Producers

Consumers

Broker

Topic

SummaryofKeyAttributes

Distribution

Guarantees

ClientLibraries

ApacheZooKeeper

KafkaProducerexampleinJava

Conclusion

ReviewQuestions

References

Chapter9–CloudComputingPrimer

Introduction

CloudComputingCharacteristics

In-housestorage

Cloudstorage

CloudComputing:EvolutionofVirtualizedArchitecture

CloudServiceModels

CloudComputingMyths

CloudComputing:GettingStarted

Conclusion

ReviewQuestions

Section3

Chapter10–WebLogAnalyzerapplicationcasestudy

Introduction

Client-ServerArchitecture

WebLoganalyzer

Requirements

SolutionArchitecture

Benefitsofthissolution

Technologystack

ApacheSpark

SparkDeployment

ComponentsofSpark

HDFS

MongoDB

ApacheFlume

OverallApplicationlogic

TechnicalPlanfortheApplication

ScalaSparkcodeforloganalysis

SampleLogdata

SampleInputData:

SampleOutputofWebLogAnalysis

ConclusionandFindings

ReviewQuestions

Chapter10:DataMiningPrimer

Gatheringandselectingdata

Datacleansingandpreparation

OutputsofDataMining

EvaluatingDataMiningResults

DataMiningTechniques

MiningBigData

FromCausationtoCorrelation

FromSamplingtotheWhole

FromDatasettoDatastream

DataMiningBestPractices

Conclusion

ReviewQuestions

Appendix1:HadoopInstallationonAmazonWebServices(AWS)ElasticComputeCluster(EC2)

CreatingClusterserveronAWS,InstallHadoopfromCloudEra

Step1:CreatingAmazonEC2Servers.

Step2:ConnectingserverandinstallingrequiredClouderadistributionofHadoop

Step3:WordCountusingMapReduce

Appendix2:SparkInstallationandTutorial

Step1:VerifyingJavaInstallation

Step2:VerifyingScalainstallation

Step3:DownloadingScala

Step4:InstallingScala

Step5:DownloadingSpark

Step6:InstallingSpark

Step7:VerifyingtheSparkInstallation

Step8:Application:WordCountinScala

AdditionalResources

AbouttheAuthor

Chapter1–WholenessofBigData

Introduction

BigDataisanall-inclusivetermthatreferstoextremelylarge,veryfast,diverse,andcomplexdatathatcannotbemanagedwithtraditionaldatamanagementtools.Ideally,BigDatawouldharnessallkindsofdata,anddelivertherightinformation,totherightperson,intherightquantity,attherighttime,tohelpmaketherightdecision.BigDatacanbemanagedbydevelopinginfinitelyscalable,totallyflexible,andevolutionarydataarchitectures,coupledwiththeuseofextremelycost-effectivecomputingcomponents.TheinfinitepotentialknowledgeembeddedwithinthiscosmiccomputerwouldhelpconnecteverythingtotheUnifiedFieldofallthelawsofnature.

ThisbookwillprovideacompleteoverviewofBigDatafortheexecutiveandthedataspecialist.ThischapterwillcoverthekeychallengesandbenefitsofBigData,andtheessentialtoolsandtechnologiesnowavailablefororganizingandmanipulatingBigData.

UnderstandingBigData

BigDatacanbeexaminedontwolevels.Onafundamentallevel,itisdatathatcanbeanalyzedandutilizedforthebenefitofthebusiness.Onanotherlevel,itisaspecialkindofdatathatposesuniquechallenges.Thisisthelevelthatthisbookwillfocuson.

Figure1‑1:BigDataContext

Atthelevelofbusiness,datageneratedbybusinessoperations,canbeanalyzedtogenerateinsightsthatcanhelpthebusinessmakebetterdecisions.Thismakesthebusinessgrowbigger,andgenerateevenmoredata,andthecyclecontinues.Thisisrepresentedbythebluecycleonthetop-rightofFigure1.1.ThisaspectisdiscussedinChapter10,aprimeronDataAnalytics.

Onanotherlevel,BigDataisdifferentfromtraditionaldataineveryway:space,time,andfunction.ThequantityofBigDatais1,000timesmorethanthatoftraditionaldata.Thespeedofdatagenerationandtransmissionis1,000timesfaster.TheformsandfunctionsofBigDataaremuchmorediverse:fromnumberstotext,pictures,audio,videos,activitylogs,machinedata,andmore.Therearealsomanymoresourcesofdata,fromindividualstoorganizationstogovernments,usingarangeofdevicesfrommobilephonestocomputerstoindustrialmachines.Notalldatawillbeofequalqualityandvalue.ThisisrepresentedbytheredcycleonthebottomleftofFigure1.1.ThisaspectofBigData,anditsnewtechnologies,isthemainfocusofthisbook.

BigDataismostlyunstructureddata.Everytypeofdataisstructureddifferently,andwillhavetobedealtwithdifferently.TherearehugeopportunitiesfortechnologyproviderstoinnovateandmanagetheentirelifecycleofBigData…togenerate,gather,store,organize,analyze,andvisualizethisdata.

CASELET:IBMWatson:ABigDatasystemIBMcreatedtheWatsonsystemasawayofpushingtheboundariesofArtificialIntelligenceandnaturallanguageunderstandingtechnologies.WatsonbeattheworldchampionhumanplayersofJeopardy(quizstyleTVshow)inFeb2011.WatsonreadsupondataabouteverythingonthewebincludingtheentireWikipedia.Itdigestsandabsorbsthedatabasedonsimplegenericrulessuchas:bookshaveauthors;storieshaveheroes;anddrugstreatailments.Ajeopardyclue,receivedintheformofacrypticphrase,isbrokendownintomanypossiblepotentialsub-cluesofthecorrectanswer.Eachsub-clueisexaminedtoseethelikelinessofitsanswerbeingthecorrectanswerforthemainproblem.Watsoncalculatestheconfidencelevelofeachpossibleanswer.Iftheconfidencelevelreachesmorethanathresholdlevel,itdecidestooffertheanswertotheclue.Itmanagestodoallthisinamere3seconds.

Watsonisnowbeingappliedtodiagnosingdiseases,especiallycancer.Watsoncanreadallthenewresearchpublishedinthemedicaljournalstoupdateitsknowledgebase.Itisbeingusedtodiagnosetheprobabilityofvariousdiseases,byapplyingfactorssuchaspatient’scurrentsymptoms,healthhistory,genetichistory,medicationrecords,andotherfactorstorecommendaparticulardiagnosis.(Source:SmartestmachinesonEarth:youtube.com/watch?v=TCOhyaw5bwg)

Figure1.2:IBMWatsonplayingJeopardy

Q1:WhatkindsofBigDataknowledge,technologiesandskillsarerequiredtobuildasystemlikeWatson?Whatkindofresourcesareneeded?

Q2:WilldoctorsbeabletocompetewithWatsonindiagnosingdiseases

andprescribingmedications?WhoelsecouldbenefitfromasystemlikeWatson?

CapturingBigDataIfdataweresimplygrowingtoolarge,ORonlymovingtoofast,ORonlybecomingtoodiverse,itwouldberelativelyeasy.However,whenthefourVs(Volume,Velocity,Variety,andVeracity)arrivetogetherinaninteractivemanner,itcreatesaperfectstorm.WhiletheVolumeandVelocityofdatadrivethemajortechnologicalconcernsandthe

costsofmanagingBigData,thesetwoVsarethemselvesbeingdrivenbythe3rdV,theVarietyofformsandfunctionsandsourcesofdata.

VolumeofData

Thequantityofdatahasbeenrelentlesslydoublingevery12-18months.TraditionaldataismeasuredinGigabytes(GB)andTerabytes(TB),butBigDataismeasuredinPetabytes(PB)andExabytes(1Exabyte=1MillionTB).

Thisdataissohugethatitisalmostamiraclethatonecanfindanyspecificthinginit,inareasonableperiodoftime.Searchingtheworld-widewebwasthefirsttrueBigDataapplication.Googleperfectedtheartofthisapplication,anddevelopedmanyofthepath-breakingtechnologiesweseetodaytomanageBigData.

Theprimaryreasonforthegrowthofdataisthedramaticreductioninthecostofstoringdata.Thecostsofstoringdatahavedecreasedby30-40%everyyear.Therefore,thereisanincentivetorecordeverythingthatcanbeobserved.Itiscalled‘datafication’oftheworld.Thecostsofcomputationandcommunicationhavealsobeencomingdown,similarly.Anotherreasonforthegrowthofdataistheincreaseinthenumberofformsandfunctionsofdata.MoreaboutthisintheVarietysection.

VelocityofData

Iftraditionaldataislikealake,BigDataislikeafast-flowingriver.BigDataisbeinggeneratedbybillionsofdevices,andcommunicatedatthespeedoftheinternet.Ingestingallthisdataislikedrinkingfromafirehose.Onedoesnothavecontroloverhowfastthedatawillcome.Ahugeunpredictabledata-streamisthenewmetaphorforthinkingaboutBigData.

Theprimaryreasonfortheincreasedvelocityofdataistheincreaseininternetspeed.Internetspeedsavailabletohomesandofficesarenowincreasingfrom10MB/secto1GB/sec(100timesfaster).Morepeoplearegettingaccesstohigh-speedinternetaroundtheworld.Anotherimportantreasonistheincreasedvarietyofsourcesthatcangenerateandcommunicatedatafromanywhere,atanytime.MoreonthatintheVarietysection.

VarietyofData

Bigdataisinclusiveofallformsofdata,forallkindsoffunctions,fromallsourcesanddevices.Iftraditionaldata,suchasinvoicesandledgerswerelikeasmallstore,BigDataisthebiggestimaginableshoppingmallthatoffersunlimitedvariety.Therearethreemajorkindsofvariety.

1. Thefirstaspectofvarietyistheformofdata.Datatypesrangeinorderofsimplicityandsizefromnumberstotext,graph,map,audio,video,andothers.Therecouldbeacompositeofdatathatincludesmanyelementsinasinglefile.Forexample,textdocumentshavetextandgraphsandpicturesembeddedinthem.Videocanhavechartsandsongsembeddedinthem.Audioandvideohavedifferentandmorecomplexstorageformatsthannumbersandtext.Numbersandtextcanbemoreeasilyanalyzedthananaudioorvideofile.Howshouldcompositeentitiesbestoredandanalyzed?

2. Thesecondaspectisthevarietyoffunctionofdata.Therearehumanchatsandconversationdata,songsandmoviesforentertainment,businesstransactionrecords,machineoperationsperformancedata,newproductdesigndata,olddataforbackup,etc.Humancommunicationdatawouldbeprocessedverydifferentlyfromoperationalperformancedata,withtotallydifferentobjectives.Avarietyofapplicationsareneededtocomparepicturesinordertorecognizepeople’sfaces;comparevoicestoidentifythespeaker;andcomparehandwritingstoidentifythewriter.

3. Thethirdaspectofvarietyisthesourceofdata.Mobilephonesandtabletdevicesenableawideseriesofapplicationsorappstoaccessdataandgeneratedatafromanytimeanywhere.Webaccesslogsareanothernewandhugesourceofdiagnosticdata.ERPsystemsgeneratemassiveamountsofstructuredbusinesstransactionalinformation.Sensorsonmachines,andRFIDtagsonassets,generateincessantandrepetitivedata.Broadlyspeaking,therearethreebroadtypesofsourcesofdata:Human-humancommunications;human-machinecommunications;andmachine-to-machinecommunications.Thesourcesofdata,andtheirrespectiveapplicationsarisingfromthatdata,willbediscussedinthenextchapter.

Figure1.3SourcesofBigData(Source:Hortonworks.com)VeracityofData

Veracityrelatestothebelievabilityandqualityofdata.BigDataismessy.Thereisalotofmisinformationanddisinformation.Thereasonsforpoorqualityofdatacanrangefromhumanandtechnicalerror,tomaliciousintent.

1. Thesourceofinformationmaynotbeauthoritative.Forexample,allwebsitesarenotequallytrustworthy.Anyinformationfromwhitehouse.govorfromnytimes.comismorelikelytobeauthenticandcomplete.Wikipediaisuseful,butnotallpagesareequallyreliable.Thecommunicatormayhaveanagendaorapointofview.

2. Thedatamaynotbereceivedcorrectlybecauseofhumanortechnicalfailure.Sensorsandmachinesforgatheringandcommunicatingdatamaymalfunctionandmayrecordandtransmitincorrectdata.Urgencymayrequirethetransmissionofthebestdataavailableatapointintime.Suchdatamakesreconciliationwithlater,accurate,recordsmoreproblematic.

3. Thedataprovidedandreceived,mayhowever,alsobeintentionallywrong,forcompetitiveorsecurityreasons.

Dataneedstobesiftedandorganizedbyqualityfactors,forittobeputtoanygreatuse.

BenefittingfromBigDataDatausuallybelongstotheorganizationthatgeneratesit.Thereisotherdata,suchassocialmediadata,thatisfreelyaccessibleunderanopengenerallicense.Organizationscanusethisdatatolearnabouttheirconsumers,improvetheirservicedelivery,anddesignnewproductstodelighttheircustomersandtogainacompetitiveadvantage.Dataisalsolikeanewnaturalresource.Itisbeingusedtodesignnewdigitalproducts,suchason-demandentertainmentandlearning.

Organizationsmaychoosetogatherandstorethisdataforlateranalysis,ortosellittootherorganizations,whomightbenefitfromit.Theymayalsolegitimatelychoosetodiscardpartsoftheirdataforprivacyorlegalreasons.However,organizationscannotaffordtoignoreBigData.OrganizationsthatdonotlearntoengagewithBigData,couldfindthemselvesleftfarbehindtheircompetition,landinginthedustbinofhistory.InnovativesmallandneworganizationscanuseBigDatatoquicklyscaleupandbeatlargerandmorematureorganizations.

BigDataapplicationsexistinallindustriesandaspectsoflife.TherearethreemajortypesofBigDataapplications:MonitoringandTracking,AnalysisandInsight,andnewdigitalproductdevelopment.

MonitoringandTrackingApplications:Consumergoodsproducersusemonitoringandtrackingapplicationstounderstandthesentimentsandneedsoftheircustomers.IndustrialorganizationsuseBigDatatotrackinventoryinmassiveinterlinkedglobalsupplychains.Factoryownersuseittomonitormachineperformanceanddopreventivemaintenance.Utilitycompaniesuseittopredictenergyconsumption,andmanagedemandandsupply.InformationTechnologycompaniesuseittotrackwebsiteperformanceandimproveitsusefulness.Financialorganizationsuseittoprojecttrendsbetterandmakemoreeffectiveandprofitablebets,etc.

AnalysisandInsight:PoliticalorganizationsuseBigDatatomicro-targetvotersandwinelections.PoliceuseBigDatatopredictandpreventcrime.Hospitalsuseittobetterdiagnosediseasesandmakemedicineprescriptions.Adagenciesuseittodesignmoretargetedmarketingcampaignsquickly.Fashiondesignersuseittotracktrendsandcreatemoreinnovativeproducts.

Figure1.4:ThefirstBigDataPresident

NewProductDevelopment:IncomingdatacouldbeusedtodesignnewproductssuchasrealityTVentertainment.Stockmarketfeedscouldbeadigitalproduct.Thisareaneedsmuchmoredevelopment.

ManagementofBigDataManyorganizationshavestartedinitiativesaroundtheuseofBigData.However,mostorganizationsdonotnecessarilyhaveagriponit.HerearesomeemerginginsightsintomakingbetteruseofBigData.

1. Acrossallindustries,thebusinesscaseforBigDataisstronglyfocusedonaddressingcustomer-centricobjectives.ThefirstfocusondeployingBigDatainitiativesistoprotectandenhancecustomerrelationshipsandcustomerexperience.

2. Solvearealpain-point.BigDatashouldbedeployedforspecificbusinessobjectivesinordertohavemanagementavoidbeingoverwhelmedbythesheersizeofitall.

3. Organizationsarebeginningtheirpilotimplementationsbyusingexistingandnewlyaccessibleinternalsourcesofdata.Itisbettertobeginwithdataunderone’scontrolandwhereonehasasuperiorunderstandingofthedata.

4. Puthumansanddatatogethertogetthemostinsight.Combiningdata-basedanalysiswithhumanintuitionandperspectivesisbetterthangoingjustoneway.

5. Advancedanalyticalcapabilitiesarerequired,butlacking,fororganizationstogetthemostvaluefromBigData.Thereisagrowingawarenessofbuildingorhiringthoseskillsandcapabilities.

6. Usemorediversedata,notjustmoredata.Thiswouldprovideabroaderperspectiveintorealityandbetterqualityinsights.

7. Thefasteryouanalyzethedata,themoreitspredictivevalue.Thevalueofdatadepreciateswithtime.Ifthedataisnotprocessedinfiveminutes,thentheimmediateadvantageislost.

8. Don’tthrowawaydataifnoimmediateusecanbeseenforit.Datahasvaluebeyondwhatyouinitiallyanticipate.Datacanaddperspectivetootherdatalateroninamultiplicativemanner.

9. Maintainonecopyofyourdata,notmultiple.Thiswouldhelpavoidconfusionandincreaseefficiency.

10. Planforexponentialgrowth.Dataisexpectedtocontinuetogrowatexponentialrates.Storagecostscontinuetofall,datagenerationcontinuestogrow,data-basedapplicationscontinuetogrowincapabilityandfunctionality.

11. Ascalableandextensibleinformationmanagementfoundationisaprerequisiteforbigdataadvancement.BigDatabuildsuponaresilient,secure,efficient,flexible,andreal-timeinformationprocessingenvironment.

12. BigDataistransformingbusiness,justlikeITdid.BigDataisanewphaserepresentingadigitalworld.Businessandsocietyarenotimmunetoitsstrongimpacts.

OrganizingBigData

Goodorganizationdependsuponthepurposeoftheorganization.

Givenhugequantities,itwouldbedesirabletoorganizethedatatospeedupthesearchprocessforfindingaspecific,adesiredthingintheentiredata.Thecostofstoringandprocessingthedata,too,wouldbeamajordriverforthechoiceofanorganizingpattern.

Giventhefastspeedofdata,itwouldbedesirabletocreateascalablenumberofingestpoints.Itwillalsobedesirabletocreateatleastathinveneerofcontroloverthedatabymaintainingcountandaveragesovertime,uniquevaluesreceived,etc.

Giventhevarietyinformfactors,dataneedstobestoredandanalyzeddifferently.Videosneedtobestoredseparatelyandusedforservinginastreamingmode.Textdatamaybecombined,cleaned,andvisualizedforthemesandsentiments.

Givendifferentqualitylevelsofdata,variousdatasourcesmayneedtoberankedandprioritizedbeforeservingthemtotheaudience.Forexample,thequalityofawebpagemaybecomputedthroughaPageRankmechanism.

AnalyzingBigData

BigDatacanbeanalyzedintwoways.ThesearecalledanalyzingBigDatainmotionorBigDataatrest.Firstwayistoprocesstheincomingstreamofdatainrealtimeforquickandeffectivestatisticsaboutthedata.Theotherwayistostoreandstructurethedataandapplystandardanalyticaltechniquesonbatchesofdataforgeneratinginsights.Thiscouldthenbevisualizedusingreal-timedashboards.BigDatacanbeutilizedtovisualizeaflowingorastaticsituation.Thenatureofprocessingthishuge,diverse,andlargelyunstructureddata,canbelimitedonlybyone’simagination.

Figure1.5:BigDataArchitecture

Amillionpointsofdatacanbeplottedinagraphandofferaviewofthedensityofdata.However,plottingamillionpointsonthegraphmayproduceablurredimagewhichmayhide,ratherthanhighlightthedistinctions.Insuchacase,binningthedatawouldhelp,orselectingthetopfewfrequentcategoriesmaydelivergreaterinsights.Streamingdatacanalsobevisualizedbysimplecountsandaveragesovertime.Forexample,belowisadynamicallyupdatedchartthatshowsup-to-datestatisticsofvisitortraffictomyblogsite,anilmah.com.Thebarshowsthenumberofpageviews,andtheinnerdarkerbarshowsthenumberofuniquevisitors.Thedashboardcouldshowtheviewbydays,weeksoryearsalso.

Figure1.6:Real-timeDashboardforwebsiteperformancefortheauthor’sblog

TextDatacouldbecombined,filtered,cleaned,thematicallyanalyzed,andvisualizedinawordcloud.Hereiswordcloudfromarecentstreamoftweets(ieTwittermessages)fromUSPresidentialcandidatesHillaryClintonandDonaldTrump.Thelargerwordsimpliesgreaterfrequencyofoccurrenceinthetweets.Thiscanhelpunderstandthemajortopicsofdiscussionbetweenthetwo.

Figure1.7:AwordcloudofHillaryClinton’sandDonaldTrump’stweets

TechnologyChallengesforBigData

Therearefourmajortechnologicalchallenges,andmatchinglayersoftechnologiestomanageBigData.

StoringHugeVolumes

Thefirstchallengerelatestostoringhugequantitiesofdata.Nomachinecanbebigenoughtostoretherelentlesslygrowingquantityofdata.Therefore,dataneedstobestoredinalargenumberofsmallerinexpensivemachines.However,withalargenumberofmachines,thereistheinevitablechallengeofmachinefailure.Eachofthesecommoditymachineswillfailatsomepointoranother.Failureofamachinecouldentailalossofdatastoredonit.

ThefirstlayerofBigDatatechnologyhelpsstorehugevolumesofdata,whileavoidingtheriskofdataloss.Itdistributesdataacrossthelargeclusterofinexpensivecommoditymachines,andensuresthateverypieceofdataisstoredonmultiplemachinestoguaranteethatatleastonecopyisalwaysavailable.Hadoopisthemostwell-knownclusteringtechnologyforBigData.ItsdatastoragepatterniscalledHadoopDistributedFileSystem(HDFS).ThissystemisbuiltonthepatternsofGoogle’sFilesystems,designedtostorebillionsofpagesandsortthemtoanswerusersearchqueries.

Ingestingstreamsatanextremelyfastpace

ThesecondchallengerelatestotheVelocityofdata,i.e.handlingtorrentialstreamsofdata.Someofthemmaybetoolargetostore,butmuststillbeingestedandmonitored.Thesolutionliesincreatingspecialingestingsystemsthatcanopenanunlimitednumberofchannelsforreceivingdata.Thesequeuingsystemscanholddata,fromwhichconsumerapplicationscanrequestandprocessdataattheirownpace.

BigDatatechnologymanagesthisvelocityproblem,usingaspecialstream-processingengine,whereallincomingdataisfedintoacentralqueueingsystem.Fromthere,afork-shapedsystemsendsdatatobatchprocessingaswellastostreamprocessingdirections.Thestreamprocessingenginecandoitsworkwhilethebatchprocessingdoesitswork.ApacheSparkisthemostpopularsystemforstreamingapplications.

Handlingavarietyofformsandfunctionsofdata

ThethirdchallengerelatestothestructuringandaccessofallvarietiesofdatathatcompriseBigData.Storingthemintraditionalflatorrelationalfilestructureswouldbetoowastefulandslow.ThethirdlayerofBigDatatechnologysolvesthisproblemby

storingthedatainnon-relationalsystemsthatrelaxmanyofthestringentconditionsoftherelationalmodel.ThesearecalledNoSQL(NotOnlySQL)databases.

HBaseandCassandraaretwoofthebetterknownNoSQLdatabasessystems.HBase,forexample,storeseachdataelementseparatelyalongwithitskeyidentifyinginformation.Thisiscalledakey-valuepairformat.Cassandrastoresdatainadocumentformat.TherearemanyothervariantsofNoSQLdatabases.NoSQLlanguages,suchasPigandHive,areusedtoaccessthisdata.

Processingdataathugespeeds

Thefourthchallengerelatestomovinglargeamountsofdatafromstoragetotheprocessor,asthiswouldconsumeenormousnetworkcapacityandchokethenetwork.Thealternativeandinnovativemodewouldbetomovetheprocessortothedata.

ThesecondlayerofBigDatatechnologyavoidsthechokingofthenetwork.Itdistributesthetasklogicthroughouttheclusterofmachineswherethedataisstored.Thosemachineswork,inparallel,onthedataassignedtothem,respectively.Afollow-upprocessconsolidatestheoutputsofallthesmalltasksanddeliversthefinalresults.MapReduce,alsoinventedbyGoogle,isthebest-knowntechnologyforparallelprocessingofdistributedBigData.

Table1.1:TechnologicalchallengesandsolutionsforBigData

Challenge Description Solution Technology

Volume Avoidriskofdatalossfrommachinefailureinclustersofcommoditymachines

Replicatesegmentsofdatainmultiplemachines;masternodekeepstrackofsegmentlocation

HDFS

Volume&Velocity

Avoidchokingofnetworkbandwidthbymovinglargevolumesofdata

Moveprocessinglogictowherethedataisstored;manageusingparallelprocessingalgorithms

Map-Reduce

Variety Efficientstorageoflargeandsmalldataobjects

Columnardatabasesusingkey-pairvaluesformat

HBase,Cassandra

Velocity Monitoringstreamstoolargetostore

Fork-shapedarchitecturetoprocessdataasstreamandasbatch

Spark

Oncethesemajortechnologicalchallengesaremet,alltraditionalanalyticalandpresentationtoolscanbeappliedtoBigData.TherearemanyadditionalsupportivetechnologiestomakethetaskofmanagingBigDataeasier.Forexample,aresourcemanager(suchasYARN)canhelpmonitortheresourceusageandloadbalancingofthemachinesinthecluster.

ConclusionandSummary

BigDataisamajorphenomenonthatimpactseveryone,andisanopportunitytocreatenewwaysofworking.BigDataisextremelylarge,complex,fast,andnotalwaysclean,itisdatathatcomesfrommanysourcessuchaspeople,web,andmachinecommunications.Itneedstobegathered,organizedandprocessedinacost-effectivewaythatmanagesthevolume,velocity,varietyandveracityofBigData.HadoopandSparksystemsarepopulartechnologicalplatformsforthispurpose.HereisalistofthemanydifferencesbetweentraditionalandBigData.

Table1.2:ComparingBigDatawithTraditionalData

Feature TraditionalData BigData

RepresentativeStructure Lake/Pool FlowingStream/river

PrimaryPurpose Managebusinessactivities Communicate,Monitor

Sourceofdata Businesstransactions,documents

Socialmedia,Webaccesslogs,machinegenerated

Volumeofdata Gigabytes,Terabytes Petabytes,Exabytes

Velocityofdata Ingestleveliscontrolled Real-timeunpredictableingest

Varietyofdata Alphanumeric Audio,Video,Graphs,Text

Veracityofdata Clean,moretrustworthy Variesdependingonsource

Structureofdata Well-Structured Semi-orUn-structured

PhysicalStorageofData

InaStorageAreaNetwork

Distributedclustersofcommoditycomputers

Databaseorganization Relationaldatabases NoSQLdatabases

DataAccess SQL NoSQLsuchasPig

DataManipulationConventionaldataprocessing Parallelprocessing

Dynamicdashboardswithsimple

DataVisualization Varietyoftools measures

DatabaseTools Commercialsystems Open-source-Hadoop,Spark

TotalCostofSystem MediumtoHigh high

OrganizationoftherestofthebookThisbookwillcoverapplications,architectures,andtheessentialBigDatatechnologies.Therestofthebookisorganizedasfollows.

Section1willdiscusssources,applications,andarchitecturaltopics.Chapter2willdiscussafewcompellingbusinessapplicationsofBigData,basedontheunderstandingofthedifferentsourcesandformatsofdata.Chapter3willcoversomeexamplesofarchitecturesusedbymanyBigDataapplications.

Section2willdiscussthesixmajortechnologyelementsidentifiedintheBigDataEcosystem(Figure1.5).Chapter4willdiscussHadoopandhowitsDistributedFilesystem(HDFS)works.Chapter5willdiscussMapReduceandhowthisparallelprocessingalgorithmworks.Chapter6willdiscussNoSQLdatabasestolearnhowtostructurethedataintodatabasesforfastaccess.PigandHivelanguages,fordataaccess,willbeincluded.Chapter7willcoverstreamingdata,andthesystemsforingestingandprocessingthisdata.ThischapterwillcoverSpark,anintegrated,in-memoryprocessingtoolsettomanageBigData.Chapter8willcoverDataingestsystem,withApacheKafka.Chapter9willbeaprimeronCloudComputingtechnologiesusedforrentingstorageandcomputersatthirdpartylocations.

Section3willincludePrimersandtutorials.Chapter10willpresentacasestudyonthewebloganalyzer,anapplicationthatingestsalogofalargenumberofwebrequestentrieseverydayandcancreatesummaryandexceptionreports.Chapter11willbeaprimerondataanalyticstechnologiesforanalyzingdata.Afulltreatmentcanbefoundinmybook,DataAnalyticsMadeAccessible.Appendix1willbeatutorialoninstallingHadoopclusteronAmazonEC2cloud.Appendix2willbeatutorialoninstallingandusingSpark.

http://www.amazon.com/dp/B00K2I2JL8

ReviewQuestionsQ1.WhatisBigData?Whyshouldanyonecare?

Q2.Describethe4VmodelofBigData.

Q3.WhatarethemajortechnologicalchallengesinmanagingBigData?

Q4:WhatarethetechnologiesavailabletomanageBigData?

Q5.WhatkindofanalysescanbedoneonBigData?

Q6:WatchClouderaCEOpresenttheevolutionofHadoopathttps://www.youtube.com/watch?v=S9xnYBVqLws.WhydidpeoplenotpayattentiontoHadoopandMapReducewhenitwasintroduced?Whatimplicationsdoesithavetoemergingtechnologies?

https://www.youtube.com/watch?v=S9xnYBVqLws

LibertyStoresCaseExercise:StepB1LibertyStoresInc.isaspecializedglobalretailchainthatsellsorganicfood,organicclothing,wellnessproducts,andeducationproductstoenlightenedLOHAS(LifestylesoftheHealthyandSustainable)citizensworldwide.Thecompanyis20yearsold,andisgrowingrapidly.Itnowoperatesin5continents,50countries,150cities,andhas500stores.Itsells20000productsandhas10000employees.Thecompanyhasrevenuesofover$5billionandhasaprofitofabout5%ofitsrevenue.Thecompanypaysspecialattentiontotheconditionsunderwhichtheproductsaregrownandproduced.Itdonatesaboutone-fifth(20%)fromitspre-taxprofitsfromgloballocalcharitablecauses.

Q1:CreateacomprehensiveBigDatastrategyfortheCEOofthecompany.

Q2:HowcanBigDatasystemssuchasIBMWatsonhelpthiscompany?

Section1

Thissectioncoversthreeimportanthigh-leveltopics.

Chapter2willcoverbigdatasources,andmanyapplicationsinmanyindustries.

Chapter3willarchitecturesformanagingbigdata

Chapter2-BigDataApplications

IntroductionIfatraditionalsoftwareapplicationisalovelycat,thenaBigDataapplicationisapowerfultiger.AnidealBigDataapplicationwilltakeadvantageofalltherichnessofdataandproducerelevantinformationtomaketheorganizationresponsiveandsuccessful.BigDataapplicationscanaligntheorganizationwiththetotalityofnaturallaws,thesourceofallsuccess.

Companiesliketheconsumergoodsgiant,Proctor&Gamble,haveinsertedBigDataintoallaspectsofitsplanningandoperations.Theindustrialgiant,Volkswagen,asksallitsbusinessunitstoidentifysomerealisticinitiativeusingBigDatatogrowtheirunit’ssales.Theentertainmentgiant,Netflix,processes400billionuseractionseveryday,andthesearesomeofthebiggestusersofBigData.

Figure2‑0‑1:BigDataapplicationisapowerfultiger(Source:Flickr.com)

CASELET:BigDataGetstheFluGoogleFluTrendswasanenormouslysuccessfulinfluenzaforecastingservice,pioneeredbyGoogle.ItemployedBigData,suchasthestreamofsearchtermsusedinitsubiquitousInternetsearchservice.TheprogramaimedtobetterpredictfluoutbreaksusingdataandinformationfromtheU.S.CentersforDiseaseControlandPrevention(CDC).Whatwasmostamazingwasthatthisapplicationwasabletopredicttheonsetofflu,almosttwoweeksbeforeCDCsawitcoming.From2004tillabout2012itwasabletosuccessfullypredictthetimingandgeographicallocationofthearrivalofthefluseasonaroundtheworld.

Figure2‑0‑2:GoogleFlutrends

However,itfailedspectacularlytopredictthe2013fluoutbreak.DatausedtopredictEbola’sspreadin2014-15yieldedwildlyinaccurateresults,andcreatedamajorpanic.Newspapersacrosstheglobespreadthisapplication’sworst-casescenariosfortheEbolaoutbreakof2014.

GoogleFluTrendsfailedfortworeasons:BigDatahubris,andalgorithmicdynamics,(a)Thequantityofdatadoesnotmeanthatonecanignorefoundationalissuesofmeasurementandconstructvalidityandreliabilityanddependenciesamongdataand(b)GoogleFluTrendspredictionswerebasedonacommercialsearchalgorithmthatfrequentlychanges,basedonGoogle’sbusinessgoals.ThisuncertaintyskewedthedatainwaysevenGoogleengineersdidnotunderstand,evenskewingtheaccuracyofpredictions.Perhapsthebiggestlessonisthatthereisfarlessinformationinthedata,typicallyavailableintheearlystagesofanoutbreak,thanisneededtoparameterizethetestmodels.

Q1:WhatlessonswouldyoulearnfromthedeathofaprominentandhighlysuccessfulBigDataapplication?

Q2:WhatotherBigDataapplicationscanbeinspiredfromthesuccessofthisapplication?

BigDataSourcesBigDataisinclusiveofalldataaboutallactivitieseverywhere.Itcan,thus,potentiallytransformourperspectiveonlifeandtheuniverse.Itbringsnewinsightsinreal-timeandcanmakelifehappierandmaketheworldmoreproductive.BigDatacan,however,alsobringperils—intermsofviolationofprivacy,andsocialandeconomicdisruption.

Therearethreemajorcategoriesofdatasources:humancommunications,human-machinecommunications,andmachine-machinecommunications.

PeopletoPeopleCommunicationsPeopleandcorporationsincreasinglycommunicateoverelectronicnetworks.Distanceandtimehavebeenannihilated.Everyonecommunicatesthroughphoneandemail.Newstravelsinstantly.Influentialnetworkshaveexpanded.Thecontentofcommunicationhasbecomericherandmultimedia.High-resolutioncamerasinmobilephonesenablepeopletotakepicturesandvideos,andinstantlysharethemwithfriendsandfamily.Allthesecommunicationsarestoredinthefacilitiesofmanyintermediaries,suchastelecomandinternetserviceproviders.Socialmediaisanew,butparticularlytransformativetypeofhuman-humancommunications.

SocialMedia

SocialmediaplatformssuchasFacebook,Twitter,LinkedIn,YouTube,Flickr,Tumblr,Skye,Snapchat,andothershavebecomeanincreasinglyintimatepartofmodernlife.Theseareamongthehundredsofsocialmediathatpeopleuseandtheygeneratehugestreamsoftext,pictures,videos,logs,andothermultimediadata.

PeoplesharemessagesandpicturesthroughsocialmediasuchasFacebookandYouTube.TheysharephotoalbumsthroughFlickr.TheycommunicateinshortasynchronousmessageswitheachotheronTwitter.TheymakefriendsonFacebook,andfollowothersonTwitter.Theydovideoconferencing,usingSkypeandleadersdelivermessagesthatsometimesgoviralthroughsocialmedia.AllthesedatastreamsarepartofBigData,andcanbemonitoredandanalyzedtounderstandmanyphenomena,suchaspatternsofcommunication,aswellasthegistoftheconversations.Thesemediahavebeenusedforawidevarietyofpurposeswithstunningeffects.

Figure2‑0‑3:Samplingofmajorsocialmedia

PeopletoMachineCommunicationsSensorsandwebaretwoofthekindsofmachinesthatpeoplecommunicatewith.PersonalassistantssuchasSiriandCortanaarethelatestinman-machinecommunicationsastheytrytounderstandhumanrequestsinnaturallanguage,andfulfilthem.WearabledevicessuchasFitBitandsmartwatcharesmartdevicesthatread,storeandanalyzepeople’spersonaldatasuchasbloodpressureandweight,foodandexercisedata,andsleeppatterns.Theworld-widewebislikeaknowledgemachinethatpeopleinteractwithtogetanswersfortheirqueries.

Webaccess

Theworld-wide-webhasintegrateditselfintoallpartsofhumanandmachineactivity.Theusageofthetensofbillionsofpagesbybillionsofwebusersgenerateshugeamountofenormouslyvaluableclickstreamdata.Everytimeawebpageisrequested,alogentryisgeneratedattheproviderend.Thewebpageprovidertrackstheidentityoftherequestingdeviceanduser,andtimeandspatiallocationofeachrequest.Ontherequesterside,therearecertainsmallpiecesofcomputercodeanddatacalledcookieswhichtrackthewebpagesreceived,date/timeofaccess,andsomeidentifyinginformationabouttheuser.Allthewebaccesslogs,andcookierecords,canprovidewebusagerecordsthatcanbeanalyzedfordiscoveringopportunitiesformarketingpurposes.

Awebloganalyzerisanapplicationrequiredtomonitorstreamingwebaccesslogsinreal-timetocheckonwebsitehealthandtoflagerrors.Adetailedcasestudyofapracticaldevelopmentofthisapplicationisshowninchapter8.

MachinetoMachine(M2M)CommunicationsM2McommunicationsisalsosometimescalledtheInternetofThings(IoT).Atrilliondevicesareconnectedtotheinternetandtheycommunicatewitheachotherorsomemastermachines.Allthisdatacanbeaccessedandharnessedbymakersandownersofthosemachines.

Machinesandequipmenthavemanykindsofsensorstomeasurecertainenvironmentalparameters,whichcanbebroadcasttocommunicatetheirstatus.RFIDtagsandsensorsembeddedinmachineshelpgeneratethedata.ContainersonshipsaretaggedwithRFIDtagsthatconveytheirlocationtoallthosewhocanlisten.Similarly,whenpalletsofgoodsaremovedinwarehousesorlargeretainstores,thosepalletscontainelectromagnetic(RFID)tagsthatconveytheirlocation.CarscarryanRFIDtranspondertoidentifythemselvestoautomatedtollboothsandpaythetolls.Robotsinafactory,andinternet-connectedrefrigeratorsinahouse,continuallybroadcasta‘heartbeat’thattheyarefunctionallynormally.Surveillancevideosusingcommoditycamerasareanothermajorsourceofmachine-generateddata.

Automobilescontainsensorsthatrecordandcommunicateoperationaldata.Amoderncarcangeneratemanymegabytesofdataeveryday,andtherearemorethan1billionmotorvehiclesontheroad.Thustheautomotiveindustryitselfgeneratehugeamountsofdata.Self-drivingcarswouldonlyaddtothequantityofdatagenerated.

RFIDtags

AnRFIDtagisaradiotransmitterwithalittleantennathatcanrespondtoandcommunicateessentialinformationtospecialreadersthroughRadioFrequency(RF)channel.Afewyearsago,majorretailerssuchasWalmartdecidedtoinvestinRFIDtechnologytotaketheretailindustrytoanewlevel.ItforcedtheirsupplierstoinvestinRFIDtagsonthesuppliedproducts.Today,almostallretailersandmanufacturershaveimplementedRFID-tagsbasedsolutions.

Figure2‑0‑4:AsmallpassiveRFIDtag

HereishowanRFIDtagworks.WhenapassiveRFIDtagcomesinthevicinityofanRFreaderandis‘tickled’,thetagrespondsbybroadcastingafixedidentifyingcode.An

activeRFIDtaghasitsownbatteryandstorage,andcanstoreandcommunicatealotmoreinformation.EveryreadingofmessagefromanRFIDtagbyanRFreadercreatesalogentry.ThusthereisasteadystreamofdatafromeveryreaderasitrecordsinformationaboutalltheRFIDtagsinitsareaofinfluence.Therecordsmaybeloggedregularly,andthustherewillbemanymorerecordsthanarenecessarytotrackthelocationandmovementofanitem.Alltheduplicateandredundantrecordsisremoved,toproduceclean,consolidateddataaboutthelocationandstatusofitems.

Sensors

Asensorisasmalldevicethatcanobserveandrecordphysicalorchemicalparameters.Sensorsareeverywhere.Aphotosensorintheelevatorortraindoorcansenseifsomeoneismovingandtothuskeepthedoorfromclosing.ACCTVcameracanrecordavideoforsurveillancepurposes.AGPSdevicecanrecorditsgeographicallocationeverymoment.

Figure2‑0‑5:Anembeddedsensor

Temperaturesensorsinacarcanmeasurethetemperatureoftheengineandthetiresandmore.Thethermostatinabuildingorarefrigeratortoohavetemperaturesensors.Apressuresensorcanmeasurethepressureinsideanindustrialboiler.

BigDataApplicationsMonitoringandTrackingApplicationsPublicHealthMonitoring

TheUSgovernmentisencouragingallhealthcarestakeholderstoestablishanationalplatformforinteroperabilityanddatasharingstandards.Thiswouldenablesecondaryuseofhealthdata,whichwouldadvanceBigDataanalyticsandpersonalizedholisticprecisionmedicine.Thiswouldbeabroad-basedplatformliketheGoogleFluTrendscase.

ConsumerSentimentMonitoring

SocialMediahasbecomemorepowerfulthanadvertising.Manyconsumergoodscompanieshavemovedabulkoftheirmarketingbudgetsfromtraditionaladvertisingmediaintosocialmedia.TheyhavesetupBigDatalisteningplatforms,whereSocialMediadatastreams(includingtweetsandFacebookpostsandblogposts)arefilteredandanalyzedforcertainkeywordsorsentiments,bycertaindemographicsandregions.Actionableinformationfromthisanalysisisdeliveredtomarketingprofessionalsforappropriateaction,especiallywhentheproductisnewtothemarket.

Figure2‑0‑6:ArchitectureforaListeningPlatform(source:Intelligenthq.com)

Assettracking

TheUSDepartmentofDefenseisencouragingtheindustrytodeviseatinyRFIDchipthatcouldpreventthecounterfeitingofelectronicpartsthatendupinavionicsorcircuitboardsforotherdevices.Airplanesareoneoftheheaviestusersofsensorswhichtrackeveryaspectoftheperformanceofeverypartoftheplane.Thedatacanbedisplayedon

thedashboard,aswellasstoredforlaterdetailedanalysis.Workingwithcommunicatingdevices,thesesensorscanproduceatorrentofdata.

Theftbyvisitors,shoppersandevenemployees,isamajorsourceoflossofrevenueforretailers.AllvaluableitemsinthestorecanbeassignedRFIDtags,andthegatesofthestoreareequippedwithRFreaders.Thishelpssecuretheproducts,andreduceleakage(theft),fromthestore.

Supplychainmonitoring

AllcontainersonshipscommunicatetheirstatusandlocationusingRFIDtags.Thus,retailersandtheirsupplierscangainreal-timevisibilitytotheinventorythroughouttheglobalsupplychain.Retailerscanknowexactlywheretheitemsareinthewarehouse,andsocanbringthemintothestoreattherighttime.Thisisparticularlyrelevantforseasonalitemsthatneedtobesoldontime,orelsetheywillbesoldatadiscount.Withitem-levelRFIDtacks,retailersalsogainfullvisibilityofeachitemandcanservetheircustomersbetter.

ElectricityConsumptionTracking

Electricutilitiescantrackthestatusofgeneratingandtransmissionsystems,andalsomeasureandpredicttheconsumptionofelectricity.Sophisticatedsensorscanhelpmonitorvoltage,current,frequency,temperature,andothervitaloperatingcharacteristicsofhugeandexpensiveelectricdistributioninfrastructure.Smartmeterscanmeasuretheconsumptionofelectricityatregularintervalsofonehourorless.Thisdataisanalyzedtomakereal-timedecisionstomaximizepowercapacityutilizationandthetotalrevenuegeneration.

PreventiveMachineMaintenance

Allmachines,includingcarsandcomputers,willfailsometime,becauseoneormoreortheircomponentswillfail.Anypreciousequipmentcouldbeequippedwithsensors.Thecontinuousstreamofdatafromthesensorsdatacouldbemonitoredandanalyzedtoforecastthestatusofkeycomponents,andthus,monitortheoverallmachine’shealth.Preventivemaintenancecanbescheduledtoreducethecostofdowntime.

AnalysisandInsightApplications

BigDatacanbestructuredandanalyzedusingdataminingtechniquestoproduceinsightsandpatternsthatcanbeusedtomakebusinessbetter.

PredictivePolicing

TheLosAngelesPoliceDepartment(LAPD)inventedtheconceptofPredictivePolicing.TheLAPDworkedwithUCBerkeleyresearcherstoanalyzeitslargedatabaseof13millioncrimesrecordedover80years,andpredictedthelikelinessofcrimesofcertaintypes,atcertaintimes,andincertainlocations.Theyidentifiedhotspotsofcrimewherecrimeshadoccurred,andwherecrimewaslikelytohappeninthefuture.Crimepatternsweremathematicallymodeledafterasimpleinsightborrowedfromametaphorofearthquakesanditsaftershocks.Inessence,itsaidthatonceacrimeoccurredinalocation,itrepresentedacertaindisturbanceinharmony,andwouldthus,leadtoagreaterlikelihoodofasimilarcrimeoccurringinthelocalvicinityinthenearfuture.Themodelshowedforeachpolicebeat,thespecificneighborhoodblocksandspecifictimeslots,wherecrimewaslikelytooccur.

Figure2‑0‑7:LAPDofficeronpredictingpolicing(Source:nbclosangeles.com)

Byincludingthepolicecars’patrolschedulesinaccordancewiththemodel’spredictions,theLAPDwasabletoreducecrimeby12%to26%fordifferentcategoriesofcrime.Recently,theSanFranciscoPoliceDepartmentreleaseditsowncrimedataforover2years,sodataanalystscouldmodelthatdataandpreventfuturecrimes.

WinningPoliticalElections

TheUSPresident,BarackObama,wasthefirstmajorpoliticalcandidatetouseBigDatainasignificantway,inthe2008elections.HeisthefirstBigDatapresident.Hiscampaigngathereddataaboutmillionsofpeople,includingsupporters.Theyinventedthe“DonateNow”buttonforuseinemailstoobtaincampaigncontributionsfrommillionsofsupporters.Theycreatedpersonalprofilesofmillionsofsupportersandwhattheyhaddoneandcoulddoforthecampaign.Datawasusedtodetermineundecidedvoterswhocouldbeconvertedtotheirside.Theyprovidedphonenumbersoftheseundecidedvoterstothesupporterstocall,andthenrecordedtheoutcomeofthosecallsallovertheweb,

usinginteractiveapplications.Obamahimselfusedhistwitteraccounttocommunicatehismessagesdirectlywithhismillionsoffollowers.

Aftertheelections,ObamaconvertedthelistofsupporterstoanadvocacymachinethatwouldprovidethegrassrootssupportforthePresident’sinitiatives.Sincethen,almostallcampaignsuseBigData.SenatorBernieSandersusedthesameBigDataplaybooktobuildaneffectivenationalpoliticalmachinepoweredentirelybysmalldonors.Analyst,NateSilver,createdsophisticalpredictivemodelsusinginputsfrommanypoliticalpollsandsurveystowinpunditstosuccessfullypredictwinnersoftheUSelections.Natewashowever,unsuccessfulinpredictingDonaldTrump’srise,andthatshowsthelimitsofBigData.

PersonalHealth

Correctdiagnosisisthesinequanonofeffectivetreatment.Medicalknowledgeandtechnologyisgrowingbyleapsandbounds.IBMWatsonisaBigDataAnalyticsenginethatingestsandmetabolizesallthemedicalinformationintheworld,andthenappliesitintelligentlytoanindividualsituation.Watsoncanprovideadetailedandaccuratemedicaldiagnosisusingcurrentsymptoms,patienthistory,medicationhistory,andenvironmentaltrends,andotherparameters.SimilarproductsmightbeofferedasanApptolicenseddoctors,andevenindividuals,toimproveproductivityandaccuracyinhealthcare.

NewProductDevelopment

Theseapplicationsaretotallynewconceptsthatdidnotexistearlier.

Flexibleautoinsurance

AnautoinsurancecompanycanusetheGPSdatafromcarstocalculatetheriskofaccidentsbasedontravelpatterns.Theautomobilecompaniescanusethecarsensordatatotracktheperformanceofacar.Saferdriverscanberewardedandtheerrantdriverscanbepenalized.

Figure2‑0‑8:GPSbasedtrackingofvehicles

Location-basedretailpromotion

Aretailer,orathird-partyadvertiser,cantargetcustomerswithspecificpromotionsandcouponsbasedonlocationdataobtainedthroughGPS,thetimeofday,thepresenceofstoresnearby,andmappingittotheconsumerpreferencedataavailablefromsocialmediadatabases.Adsandofferscanbedeliveredthroughmobileapps,SMS,andemail.Theseareexamplesofmobileapps.

Recommendationservice

Ecommerceisafastgrowingindustryinthelastcoupleofdecades.Avarietyofproductsaresoldandsharedovertheinternet.Webusers’browsingandpurchasehistoryonecommercesitesisutilizedtolearnabouttheirpreferencesandneeds,andtoadvertiserelevantproductandpricingoffersinreal-time.Amazonusesapersonalizedrecommendationenginesystemtosuggestnewadditionalproductstoconsumersbasedonaffinitiesofvariousproducts.Netflixalsousesarecommendationenginetosuggestentertainmentoptionstoitsusers.

ConclusionBigDatahasapplicabilityacrossallindustries.TherearethreemajortypesofdatasourcesofBigData.Theyarepeople-peoplecommunications,people-machinecommunications,andmachine-machinecommunications.Eachtypehasmanysourcesofdata.Therearethreetypesofapplications.Theyarethemonitoringtype,theanalysistype,andnewproductdevelopment.Thischapterpresentsafewbusinessapplicationsofeachofthosethreetypes.

ReviewQuestionsQ1:WhatarethemajorsourcesofBigData?Describeasourceofeachtype.

Q2:WhatarethethreemajortypesofBigDataapplications?Describetwoapplicationsofeachtype.

Q3:WoulditbeethicaltoarrestsomeonebasedonaBigDataModel’spredictionofthatpersonlikelytocommitacrime?

Q4:AnautoinsurancecompanylearnedaboutthemovementsofapersonbasedontheGPSinstalledinthevehicle.Woulditbeethicaltousethatasasurveillancetool?

Q5:ResearchcandescribeaBigDataapplicationthathasaprovenreturnoninvestmentforanorganization.

LibertyStoresCaseExercise:StepB2TheBoardofDirectorsaskedthecompanytotakeconcreteandeffectivestepstobecomeadata-drivencompany.Thecompanywantstounderstanditscustomersbetter.Itwantstoimprovethehappinesslevelsofitscustomersandemployees.Itwantstoinnovateonnewproductsthatitscustomerswouldlike.Itwantstorelateitscharitableactivitiestotheinterestsofitscustomers.

Q1:Whatkindofdatasourcesshouldthecompanycaptureforthis?

Q2:WhatkindofBigDataapplicationswouldyousuggestforthiscompany?

Chapter3-BigDataArchitecture

IntroductionBigDataApplicationArchitectureistheconfigurationoftoolsandmodulestoaccomplishthewholetask.Anidealarchitecturewouldberesilient,secure,cost-effective,andadaptivetonewneedsandenvironments.Thisisachievedthroughbeginningwithprovenarchitectures,andcreativelyandprogressivelyrestructuringitwithnewelementsasadditionalneedsandproblemsarise.BigDataarchitecturesultimatelyalignwiththearchitectureoftheUniverse,thesourceofallinvincibility.

CASELET:GoogleQueryArchitectureGoogleinventedthefirstBigDataarchitecture.Theirgoalwastogatheralltheinformationontheweb,organizeit,andsearchitforspecificqueriesfrommillionsofusers.Anadditionalgoalwastofindawaytomonetizethisservicebyservingrelevantandprioritizedonlineadvertisementsonbehalfofclients.

Googledevelopedwebcrawlingagentswhichwouldfollowallthelinksinthewebandmakeacopyofallthecontentonallthewebpagesitvisited.

Googleinventedcost-effective,resilient,andfastwaystostoreandprocessallthatexponentiallygrowingdata.Itdevelopedascale-outarchitectureinwhichitcouldlinearlyincreaseitsstoragecapacitybyinsertingadditionalcomputersintoitscomputingnetwork.Thedatafilesweredistributedoverthelargenumberofmachinesinthecluster.ThisdistributedfilessystemwascalledtheGoogleFilesystem,andwastheprecursortoHDFS.

Googlewouldsortorindexthedatathusgatheredsoitcanbesearchedefficiently.Theyinventedthekey-pairNoSQLdatabasearchitecturetostorevarietyofdataobjects.Theydevelopedthestoragesystemtoavoidupdatesinthesameplace.Thusthedatawaswrittenonce,andreadmultipletimes.

Figure3‑0‑1:GoogleQueryArchitecture

GoogledevelopedtheMapReduceparallelprocessingarchitecturewherebylargedatasetscouldbeprocessedbythousandsofcomputersinparallel,witheachcomputerprocessingachunkofdata,toproducequickresultsfortheoveralljob.

TheHadoopecosystemofdatamanagementtoolslikeHadoopdistributedfilesystem(HDFS),columnardatabasesystemlikeHBase,aqueryingtoolsuchasHive,andmore,emergedfromGoogle’sinventions.Stormisastreamingdatatechnologiestoproduceinstantresults.LambdaArchitectureisaY-shapedarchitecturethatbranchesouttheincomingdatastreamforbatchaswellasstreamprocessing.

Q1:WhyshouldGooglepublishitsFileSystemandtheMapReduceparallelprogrammingsystemandsenditintoopen-sourcesystem?

Q2:WhatelsecanbedonewithGoogle’srepositoryofalltheweb’sdata?

StandardBigdataarchitectureHereisthegenericBigDataArchitectureintroducedinChapter1.Therearemanysourcesofdata.Alldataisfunneledinthroughaningestsystem.Thedataisforkedintotwosides:astreamprocessingsystemandabatchprocessingsystem.TheoutcomeoftheseprocessingcanbesentintoNoSQLdatabasesforlaterretrieval,orsentdirectlyforconsumptionbymanyapplicationsanddevices.

Figure3‑0‑2:BigDataApplicationArchitecture

Abigdatasolutiontypicallycomprisestheseaslogicallayers.Eachlayercanberepresentedbyoneormoreavailabletechnologies.

Bigdatasources:Thesourcesofdataforanapplicationdependsuponwhatdataisrequiredtoperformthekindofanalysesyouneed.ThevarioussourcesofBigdataweredescribedinchapter2.Thedatawillvaryinorigin,size,speed,form,andfunction,asdescribedbythe4Vsinchapter1.Datasourcescanbeinternalorexternaltotheorganization.Thescopeofaccesstodataavailablecouldbelimited.Thelevelofstructurecouldbehighorlow.Thespeedofdataanditsquantitywillalsobyhighorlowdependinguponthedatasource.

Dataingestlayer:Thislayerisresponsibleforacquiringdatafromthedatasources.Thedataisthroughascalablesetofinputpointsthatcanacquireatvariousspeedsandinvariousquantities.Thedataissenttoabatchprocessingsystem,astreamprocessingsystem,ordirectlytoastoragefilesystem(suchasHDFS).Complianceregulationsandgovernancepoliciesimpactwhatdatacanbestoredandforhowlong.

BatchProcessinglayer:TheanalysislayerreceivesdatafromtheingestpointorfromthefilesystemorfromtheNoSQLdatabases.Dataisprocessedusingparallelprogrammingtechniques(suchasMapReduce)toprocessitandproducethedesiredresults.Thisbatchprocessinglayerthusneedstounderstandthedatasourcesanddatatypes,thealgorithms

thatwouldworkonthatdata,andtheformatofthedesiredoutcomes.Theoutputofthislayercouldbesentforinstantreporting,orstoredinaNoSQLdatabasesforanon-demandreport,fortheclient.

StreamingProcessinglayer:Thislayerreceivesdatadirectlyfromtheingestpoint.Dataisprocessedusingparallelprogrammingtechniques(suchasMapReduce)toprocessitinrealtime,andproducethedesiredresults.Thislayerthusneedstounderstandthedatasourcesanddatatypesextremelywell,andthesuper-lightalgorithmsthatwouldworkonthatdatatoproducethedesiredresults.TheoutcomeofthislayertoocouldbestoredintheNoSQLDatabases.

DataOrganizingLayer:Thislayerreceivesdatafromboththebatchandstreamprocessinglayers.Itsobjectiveistoorganizethedataforeasyaccess.ItisrepresentedbyNoSQLdatabases.SQL-likelanguageslikeHiveandPigcanbeusedtoeasilyaccessdataandgeneratereports.

DataConsumptionlayer:Thislayerconsumestheoutputprovidedbytheanalysislayers,directlyorthroughtheorganizinglayer.Theoutcomecouldbestandardreports,dataanalytics,dashboardsandothervisualizationapplications,recommendationengine,onmobileandotherdevices.

InfrastructureLayer:Atbottomthereisalayerthatmanagestherawresourcesofstorage,compute,andcommunication.Thisisincreasinglyprovidedthroughacloudcomputingparadigm.

DistributedFileSystemLayer:ItwouldalsoincludetheHadoopDistributedFileSystem(HDFS).Itwouldalsoincludesupportingapplications,suchasYARN(YetAnotherResourceManager),thatenabletheefficientaccesstodatastorageanditstransfer.

BigDataArchitectureexamplesEverymajororganizationandapplicationshasauniqueoptimizedinfrastructuretosuititsspecificneeds.HerebelowaresomearchitectureexamplesfromsomeveryprominentusersanddesignersofBigDataapplications.

IBMWatson

IBMWatsonusesSparktomanageincomingdatastreams.ItalsousesSpark’sMachineLearninglibrary(MLLib)toanalyzedataandpredictdiseases.

Netflix

Thisisoneofthelargestprovidersofonlinevideoentertainment.Theyhandle400Billiononlineeventsperday.Asacutting-edgeuserofbigdatatechnologies,theyareconstantlyinnovatingtheirmixoftechnologiestodeliverthebestperformance.Kafkaisthecommonmessagingsystemforallincomingrequests.TheyhosttheentireinfrastructureonAmazonWebServices(AWS).ThedatabaseisAWS’S3aswellasCassandraandHbasetostoredata.Sparkisusedforstreamprocessing.

(Source:Netflix)

Ebay

Ebayisthesecond-largestEcommercecompanyintheworld.Itdelivers800millionlistingsfrom25millionsellersto160millionbuyers.Tomanagethishugestreamofactivity,EBayusesastackofHadoop,Spark,Kafka,andotherelements.TheythinkthatKafkaisthebestnewthingforprocessingdatastreams.

VMWare

HereisVMware’sviewofaBigDataarchitecture.Itissimilarto,butmoredetailedthan,ourmainbigarchitecturediagram.

TheWeatherCompany

TheWeathercompanyservesweatherdatagloballythroughwebsitesandmobileapps.ItusesstreamingarchitectureusingApacheSpark.

TicketMaster

Thisistheworld’slargestcompanythatsellseventtickets.Theirgoalistomaketicketsavailabletopurchaseforrealfans,andpreventbadactorsfrommanipulatingthesystemtoincreasethepriceoftheticketsinthesecondarymarkets.

LinkedIn

Thegoalofthisprofessionalnetworkingcompanyistomaintainanefficientsystemforprocessingthestreamingdataandmakethelinkoptionsavailableinreal-time.

Paypal

Thispayments-facilitationcompanyneedstounderstandandacquirecustomers,andprocessalargenumberofpaymenttransactions.

CERN

Thispremierhigh-energyphysicsresearchlabcomputepetabytesofdatausingin-memorystreamprocessingtoprocessdatafrommillionsofsensorsanddevices.

ConclusionBigDataapplicationsarearchitectedtodostreamaswellasbatchprocessing.Dataisingestedandfedintostreamingandbatchprocessing.MosttoolsusedforbigdataprocessingareopensourcetoolsservedthroughtheApachecommunity,andsomekeydistributorsofthosetechnologies.

ReviewQuestionsQ1:DescribetheBigDataprocessingarchitecture.

Q2:WhatareGoogle’scontributionstoBigdataprocessing?

Q3:WhataresomeofthehottesttechnologiesvisibleinBigDataprocessing?

LibertyStoresCaseExercise:StepB3ThewantstobuildascalableandfuturisticplatformforitsBigData.

Q1:WhatkindofBigDataProcessingarchitecturewouldyousuggestforthiscompany

Section2

ThissectioncoverstheimportantBigDatatechnologiesdefinedintheBigDataarchitecturespecifiedinchapter3.

Chapter4willcoverHadoopanditsDistributedFileSystem(HDFS)

Chapter5willcovertheparallelprocessingalgorithm,MapReduce.

Chapter6willNoSQLdatabasessuchasHBaseandCassandra.ItwillalsocoverPigandHivelanguagesusedforaccessingthosedatabases.

Chapter7willcoverSpark,afastandintegratedstreamingdatamanagementplatform.

Chapter8willcoverDataIngestsystems,usingApacheKafka

Chapter9willcoverCloudComputingmodel.

Chapter4:DistributedComputingusingHadoopIntroductionAdistributedsystemisacleverwayofstoringhugequantitiesofdata,securelyandcost-effectively,forspeedandease,forretrievalandprocessing,usinganetworkedcollectionofcommoditymachines.Theidealdistributedfilesystemwouldstoreinfiniteamountsofdatawhilemakingthecomplexitycompletelytransparenttotheuser,andenableeasyaccesstotherightdatainstantly.Thiswouldbeachievedbystoringfragmentsofdataatdifferentlocations,andinternallymanagingthelower-leveltasksofstoringandreplicatingdataacrossthenetwork.ThedistributedsystemultimatelyleadstothecreationoftheunboundedcosmiccomputerthatisalignedwiththeUnifiedFieldofallthelawsofnature.

HadoopFrameworkTheApacheHadoopdistributedcomputingframeworkiscomposedofthefollowingmodules:

1. HadoopCommon–containslibrariesandutilitiesneededbyotherHadoopmodules

2. HadoopDistributedFileSystem(HDFS)–adistributedfile-systemthatstoresdataoncommoditymachines,providingveryhighaggregatebandwidthacrossthecluster

3. YARN–aresource-managementplatformresponsibleformanagingcomputingresourcesinclustersandusingthemforschedulingofusers’applications,and

4. MapReduce–animplementationoftheMapReduceprogrammingmodelforlargescaledataprocessing.

ThischapterwillcoverHadoopCommon,HDFS,andYARN.ThenextchapterwillcoverMapReduce.

https://en.m.wikipedia.org/wiki/MapReduce

HDFSDesignGoalsTheHadoopdistributedfilesystem(HDFS)isadistributedandscalablefile-system.Itisdesignedforapplicationsthatdealwithlargedatasizes.Itisalsodesignedtodealwithmostlyimmutablefiles,i.e.writedataonce,butreaditmanytimes.

HDFShasthefollowingmajordesigngoals:

1. Hardwarefailuremanagement–itwillhappen,andonemustplanforit.2. Hugevolume–createcapacityforlargenumberofhugefilesizes,withfast

read/writethroughput3. Highspeed–createamechanismtoprovidelowlatencyaccesstostreaming

applications4. Highvariety–Maintainsimpledatacoherence,bywritingdataoncebutreading

manytimes.5. Open-source–Maintaineasyaccessibilityofdatausinganyhardware,software,

anddatabaseplatform6. Networkefficiency–Minimizenetworkbandwidthrequirement,byminimizing

datamovement

Master-SlaveArchitectureHadoopisanarchitecturefororganizingcomputersinamaster-slaverelationshipthathelpsachievegreatscalabilityinprocessing.AnHDFSclusterhastwotypesofnodesoperatinginamaster−workerpattern:asinglemasternode(calledNameNode),andalargenumberofslaveworkernodes(calledDataNodes).AsmallHadoopclusterincludesasinglemasterandmultipleworkernodes.AlargeHadoopclusterwouldconsistofamasterandthousandsofsmallordinarymachinesasworkernodes.

Figure4‑0‑1:Master-SlaveArchitecture

Themasternodemanagestheoverallfilesystem,itsnamespace,andcontrolstheaccesstofilesbyclients.Themasternodeisawareofthedata-nodes:i.e.whatblocksofwhichfilearestoredonwhichdatanode.Italsocontrolstheprocessingplanforallapplicationsrunningonthedataonthecluster.Thereisonlyonemasternode.Unfortunately,thatmakesitasinglepointoffailure.Therefore,wheneverpossible,themasternodehasahotbackupjustincasethemasternodediesunexpectedly.Themasternodeusesatransactionlogtopersistentlyrecordeverychangethatoccurstofilesystemmetadata.

Theworkernodesstorethedatablocksintheirstoragespace,asdirectedbythemasternode.Eachworkernodetypicalcontainsmanydiskstomaximizestoragecapacityandaccessspeed.Eachworkernodehasitsownlocalfilesystem.Aworkernodehasnoawarenessofthedistributedfilestructure.Itsimplystoreseachblockofdataasdirected,asifeachblockwereaseparatefile.TheDataNodesstoreandserveupblocksofdataoverthenetworkusingablockprotocol,underthedirectionoftheNameNode.

Figure4‑0‑2:HadoopArchitecture(Source:Hadoop.apache.org)

TheNamenodestoresallrelevantinformationaboutalltheDataNodes,andthefilesstoredinthoseDataNodes.TheNameNodewillcontain:

-ForeveryDataNode,itsname,Rack,Capacity,andHealth

-ForeveryFile,itsName,replicas,Type,Size,TimeStamp,Location,Health,etc.

ItaDataNodefails,thereisnoseriousproblem.ThedataonthefaileddataNodewillbeaccessedfromitsreplicasonotherDataNodes.ThefailedDataNodecanbeautomaticallyrecreatedonanothermachine,bywritingallthosefileblocksoffromtheotherhealthyreplicas.Eachdata-nodesendsaheartbeatmessagetothename-nodeperiodically.Withoutthismessage,theDataNodeisassumedtobedead.TheDataNodereplicationeffortwouldautomaticallykick-intoreplacethedeaddata-node.

Thefilesystemhasasetoffeaturesandcapabilitiestocompletelyhidethesplinteringandscatteringofdata,andenabletheusertodealwiththedataatahigh,logicallevel.

TheNameNodetriestoensurethatfilesareevenlyspreadacrossthedata-nodesinthecluster.Thatbalancesthestorageandcomputingload,andalsolimitstheextentoflossfromthefailureofanode.TheNameNodealsotriestooptimizethenetworkingload.Whenretrievingdataororderingtheprocessing,theNameNodetriestopickFragmentsfrommultiplenodestobalancetheprocessingloadandspeedupthetotallyprocessingeffort.TheNameNodealsotriestostorefragmentsoffilesonthesamenodeforspeedofreadandwriting.Processingisdoneonthenodewherethefilefragmentisstored.

Anypieceofdataisstoredtypicallyonthreenodes:twoonthesamerack,andoneona

differentrack.Datanodescantalktoeachothertorebalancedata,tomovecopiesaround,andtokeepthereplicationofdatahigh.

BlocksystemHDFSstoreslargefiles(typicallygigabytestoterabytes)bystoringsegments(calledblocks)ofthefileacrossmultiplemachines.AblockofdataisthefundamentalstorageunitinHDFS.Datafilesaredescribed,readandwritteninblock-sizedgranularity.Allstoragecapacityandfilesizesaremeasuredinblocks.Ablockrangesfrom16-128MBinsize,withadefaultblocksizeof64MB.Thus,anHDFSfileischoppedupinto64MBchunks,andifpossible,eachchunkwillresideonadifferentDataNode.

Everydatafiletakesupanumberofblocksdependinguponitssize.Thusa100MBfilewilloccupytwoblocks(100MBdividedby64MB),withsomeroomtospare.Everystoragediskcanaccommodateanumberofblocksdependinguponthesizeofthedisk.Thusa1Terabytestoragewillhave16000blocks(1TBdividedby64MB).

Everyfileisorganizedasaconsecutivelynumberedsequenceofblocks.Afile’sblocksarestoredphysicallyclosetoeachotherforeaseofaccess,asfaraspossible.Thefile’sblocksizeandreplicationfactorareconfigurablebytheapplicationthatwritesthefileonHDFS.

EnsuringDataIntegrityHadoopensuresthatnodatawillbelostorcorrupted,duringstorageorprocessing.Thefilesarewrittenonlyonce,andneverupdatedinplace.Theycanbereadmanytimes.Onlyoneclientcanwriteorappendtoafile,atatime.Noconcurrentupdatesareallowed.

Ifadataisindeedlostorcorrupted,orifapartofthediskgetscorrupted,anewhealthyreplicaforthatlostblockwillbeautomaticallyrecreatedbycopyingfromthereplicasonotherdata-nodes.Atleastoneofthereplicasisstoredonadata-nodeonadifferentrack.Thisguardsagainstthefailureoftherackofnodes,orthenetworkingrouter,onit.

AchecksumalgorithmisappliedonalldatawrittentoHDFS.Aprocessofserializationisusedtoturnfilesintoabytestreamfortransmissionoveranetworkorforwritingtopersistentstorage.Hadoophasadditionalsecuritybuiltin,usingKerberosverifier.

InstallingHDFSItispossibletorunHadooponanin-houseclusterofmachines,oronthecloudinexpensively.Asanexample,TheNewYorkTimesused100AmazonElasticComputeCloud(EC2)instances(DataNodes)andaHadoopapplicationtoprocess4TBofrawimageTIFFdatastoredinAmazonSimpleStorageService(S3)into11millionfinishedPDFsinthespaceof24hoursatacomputationcostofabout$240(notincludingbandwidth).SeeChapter9foraprimeronCloudComputing.SeeAppendix1forastep-by-steptutorialoninstallingHadooponAmazonEC2.

HadoopiswritteninJava.HadoopalsorequiresaworkingJavainstallation.InstallingHadooptakesalotofresources.Forexample,allinformationaboutfragmentsoffilesneedstobeinName-nodememory.AthumbruleisthatHadoopneedsapproximately1GBmemorytomanage1Mfilefragments.ManyeasymechanismsexisttoinstalltheentireHadoopstack.UsingaGUIsuchasClouderaResourcesManagertoinstallaClouderaHadoopstackiseasy.Thisstackincludes,HDFS,andmanyotherrelatedcomponents,suchasHBase,Pig,YARN,andmore.InstallingitonaclusteronacloudservicesproviderlikeAWSiseasierthaninstallingJavaVirtualMachines(JVMs)onHDFScanbeinstalledbyusingClouderaGUIResourcesManager.Ifdoingfromcommandline,downloadHadoopfromoneoftheApachemirrorsites

HadoopiswritteninJava.AndmostaccesstofilesisprovidedthroughJavaabstractclassorg.apache.hadoop.fs.FileSystem.HDFScanbemounteddirectlywithaFilesysteminUserspace(FUSE)virtualfilesystemonLinuxandsomeotherUnixsystems.FileaccesscanbeachievedthroughthenativeJavaapplicationprogramminginterface(API).AnotherAPI,calledThrift,helpstogenerateaclientinthelanguageoftheusers’choosing(suchasC++,Java,Python).WhentheHadoopcommandisinvokedwithaclassnameasthefirstargument,itlaunchesaJavavirtualmachine(JVM)toruntheclass,alongwiththerelevantHadooplibraries(andtheirdependencies)ontheclasspath.

HDFShasaUNIX-likecommandlikeinterface(CLI).UseshshelltocommunicatewithHadoop.HDFShasUNIX-likepermissionsmodelforfilesanddirectories.Therearethreeprogressivelyincreasinglevelsofpermissions:read(r),write(w),andexecute(x).Createahduser,andcommunicateusingsshshellonthelocalmachine.

%hadoopfs-help##getdetailedhelponeverycommand.

ReadingandWritingLocalFilesintoHDFS

Therearetwodifferentwaystotransferdata:fromthelocalfilesystem,orforman

https://en.m.wikipedia.org/wiki/The_New_York_Times

https://en.m.wikipedia.org/wiki/Amazon_Elastic_Compute_Cloud

https://en.m.wikipedia.org/wiki/TIFF

https://en.m.wikipedia.org/wiki/Amazon_Simple_Storage_Service

https://en.m.wikipedia.org/wiki/PDF

https://en.m.wikipedia.org/wiki/Java_%28software_platform%29

https://en.m.wikipedia.org/wiki/Mount_%28computing%29

https://en.m.wikipedia.org/wiki/Filesystem_in_Userspace

https://en.m.wikipedia.org/wiki/Virtual_file_system

https://en.m.wikipedia.org/wiki/Linux

https://en.m.wikipedia.org/wiki/Unix

https://en.m.wikipedia.org/wiki/Application_programming_interface

https://en.m.wikipedia.org/wiki/Thrift_%28protocol%29

input/outputstream.CopyingafilefromthelocalfilesystemtoHDFScanbedoneby:

%hadoopfs-copyFromLocalpath/filename

ReadingandWritingDataStreamsintoHDFS

ReadafilefromHDFSbyusingajava.net.URLobjecttoopenastreamtoreadthedatarequiresashortscript,asbelow.

InputStreamin=null;

Start{

instream=newURL(“hdfs://host/path”).openStream();//detailsofprocessin}

Finish{IOUtils.closeStream(instream);}

Asimplemethodtocreateanewfileisasfollows:

publicFSDataOutputStreamcreate(Pathp)throwsIOException

Datacanbeappendedtoanexistingfileusingtheappend()method:

publicFSDataOutputStreamappend(Pathp)throwsIOException

Adirectorycanbecreatedbyasimplemethod:

publicbooleanmkdirs(Pathp)throwsIOException

Listthecontentsofadirectoryusing:

publicFileStatus[]listStatus(Pathp)throwsIOException

publicFileStatus[]listStatus(Pathp,PathFilterfilter)throwsIOException

SequenceFilesTheincomingdatafilescanrangefromverysmalltoextremelylarge,andwithdifferentstructures.BigDatafilesarethereforeorganizedquitedifferentlytohandlethediversityoffilesizesandtype.LargefilesarestoredasHDFSfiles,withFileFragmentsdistributedacrossthecluster.However,smallerfilesshouldbebunchedtogetherintosinglesegmentforefficientstorage.

SequenceFilesareaspecializeddatastructurewithinHadooptohandlesmallerfileswithsmallerrecordsizes.SequenceFileusesapersistentdatastructurefordataavailableinkey-valuepairformat.Thesehelpefficientlystoresmallerobjects.HDFSandMapReducearedesignedtoworkwithlargefiles,sopackingsmallfilesintoaSequenceFilecontainer,makesstoringandprocessingthesmallerfilesmoreefficientforHDFSandMapReduce.

Sequencefilesarerow-orientedfileformats,whichmeansthatthevaluesforeachrowarestoredcontiguouslyinthefile.Thisformatsareappropriatewhenalargenumberofcolumnsofasinglerowareneededforprocessingatthesametime.Thereareeasycommandstocreate,readandwriteSequenceFilestructures.SortingandmergingSequenceFilesisnativetoMapReducesystem.AMapFileisessentiallyasortedSequenceFilewithanindextopermitlookupsbykey.

YARNYARN(YetAnotherResourceNegotiator)isthearchitecturalcenterofHadoop,Itisoftencharacterizedasalarge-scale,distributedoperatingsystemforbigdataapplications.YARNmanagesresourcesandmonitorsworkloads,inasecuremulti-tenantenvironment,whileensuringhighavailabilityacrossmultipleHadoopclusters.YARNalsobringsgreatflexibilityasacommonplatformtorunmultipletoolsandapplicationssuchasinteractiveSQL(e.g.Hive),real-timestreaming(e.g.Spark),andbatchprocessing(MapReduce),toworkondatastoredinasingleHDFSstorageplatform.Itbringsclustersmorescalabilitytoexpandbeyond1000nodes,italsoimprovesclusterutilizationthroughdynamicallocationofclusterresourcestovariousapplications.

Figure4‑0‑3:HadoopDistributedArchitectureincludingYARN

TheResourceManagerinYARNhastwomaincomponents:SchedulerandApplicationsManager.

YARNSchedulerallocatesresourcestothevariousrequestingapplications.ItdoessobasedonanabstractnotionofaresourceContainerwhichincorporateselementssuchasMemory,CPU,Diskstorage,Network,etc.EachmachinealsohasaNodeManagerthatmanagesalltheContainersonthatmachine,andreportsstatusonresourcesandContainerstotheYARNScheduler.

YARNApplicationsManageracceptsnewjobsubmissionsfromtheclient.ItthenrequestsafirstresourceContainerfortheapplication-specificApplicationMasterprogram,andmonitorsthehealthandexecutionoftheapplication.Oncerunning,theApplicationMasterdirectlynegotiatesadditionalresourcecontainersfromtheSchedulerasneeded.

http://searchdatamanagement.techtarget.com/definition/big-data-management

ConclusionHadoopisthemajortechnologyformanagingbigdata.HDFSsecurelystoresdataonlargeclustersofcommoditymachines.Amastermachinecontrolsthestorageandprocessingactivitiesoftheworkermachines.ANameNodecontrolsthenamespaceandstorageinformationforthefilesystemontheDataNodes.AmasterJobTrackercontrolstheprocessingoftasksattheDataNodes.YARNistheresourcesmanagerthatmanagesallresourcesdynamicallyandefficientlyacrossallapplicationsonthecluster.HadoopFilesystemandotherpartsoftheHadoopstackaredistributedbymanyvendors,andcanbeeasilyinstalledoncloudcomputinginfrastructure.HadoopinstallationtutorialisinAppendixA.

ReviewQuestionsQ1:HowdoesHadoopdifferfromatraditionalfilesystem?

Q2:WhatarethedesigngoalsforHDFS?

Q3:HowdoesHDFSensuresecurityandintegrityofdata?

Q4:Howdoesamasternodedifferfromtheworkernode?

Chapter5–ParallelProcessingwithMapReduce

Introduction

Aparallelprocessingsystemisacleverwaytoprocesshugeamountsofdatainashortperiodoftimebyenlistingtheservicesofmanycomputingdevicestoworkonpartsofthejob,simultaneously.Theidealparallelprocessingsystemwillworkacrossanycomputationalproblem,usinganynumberofcomputingdevices,acrossanysizeofdatasets,witheaseandhighprogrammerproductivity.Thisisachievedbyframingtheprobleminawaythatitcanbebrokendownintomanyparts,suchthatthateachpartcanbepartiallyprocessedindependentlyoftheotherparts;andthentheintermediateresultsfromprocessingthepartscanbecombinedtoproduceafinalsolution.Infiniteparallelprocessingistheessenceofinfinitedynamismofthelawsofnature.

MapReduceOverview

MapReduceisaparallelprogrammingframeworkforspeedinguplargescaledataprocessingforcertaintypesoftasks.ItachievessowithminimalmovementofdataondistributedfilesystemssuchasHDFSclusters,toachievenear-realtimeresults.Therearetwomajorpre-requisitesforMapReduceprogramming.(a)Theapplicationmustlenditselftoparallelprogramming.(b)Thedatacanbeexpressedinkey-valuepairs.

MapReduceprocessingissimilartoUNIXsequence(alsocalledpipe)structure

e.g.theUNIXcommand:

grep|sort|countmyfile.txt

willproduceawordcountinthetextdocumentcalledmyfile.txt.

Therearethreecommandsinthissequence,andtheyworkasfollows:(a)grepiscommandtoreadthetextfileandcreateanintermediatefilewithonewordonaline;(b)sortcommandwillsortthatintermediatefile,andproduceanalphabeticallysortedlistofwordsinthatset;(c)thecountcommandwillworkonthatsortedlist,toproducethenumberofoccurrencesofeachword,anddisplaytheresultstotheuserina“word,frequency”pairformat.

Forexample:Supposemyfile.txtcontainsthefollowingtext:

Myfile:Wearegoingtoapicnicnearourhouse.Manyofourfriendsarecoming.Youarewelcometojoinus.Wewillhavefun.

TheoutputsofGrep,SortandWordcountwillasshownbelow.

Grep Sort WordCount

We a a 1

are are are 3

going are coming 1

to are friends 1

a coming fun 1

picnic friends going 1

near fun have 1

our going house 1

house have join 1

Many house many 1

of join near 1

our many of 1

friends near our 2

are of picnic 1

coming our to 2

You our us 1

are picnic we 2

welcome to welcome 1

to to will 1

join us you 1

us We

we we

will welcome

have will

fun you

Ifthefileisverylarge,thenitwillbetakethecomputeralongtimetoprocessit.Parallelprocessingcanhelphere.

MapReducespeedsupthecomputationbyreadingandprocessingsmallchunksoffile,bydifferentcomputersinparallel.Thusifafilecanbebrokendowninto100smallchunks,

eachchunkcanbeprocessedataseparatecomputerinparallel.Thetotaltimetakentoprocessthefilecouldbe1/100ofthetimetakenotherwise.However,nowtheresultsofthecomputationonsmallchunksareresidingina100differentplaces.Theselargenumberofpartialresultsneedtobecombinedtoproduceacompositeresult.TheresultsoftheoutputsfromvariouschunkswillbecombinedbyanotherprogramcalledtheReduceprogram.

TheMapstepwilldistributethefulljobintosmallertasksthatcanbedoneonseparatecomputerseachusingonlyapartofthedataset.TheresultoftheMapstepwillbeconsideredasintermediateresults.TheReducestepwillreadtheintermediateresults,andwillcombineallofthemandproducethefinalresult.Theprogrammerneedstospecifiesthefunctionallogicforboththemapandreducesteps.Thesorting,betweentheMapandReducesteps,doesnotneedtobespecifiedandisautomaticallytakencareoftheMapReducesystemasastandardserviceprovidedtoeveryjob.Thesortingofthedatarequiresafieldtosorton.Thustheintermediateresultsneedtohavesomekindofakeyfield,andasetofassociatednon-keyattribute(s)forthatkey.

Figure5‑0‑1:MapReduceArchitecture

Inpractice,tomanagethevarietyofdatastructuresstoredinthefilesystem,dataisstoredasonekeyandonenon-keyattribute.Thusthedataisrepresentedasakey-valuepair.Theintermediateresults,andthefinalresultsallwillalsobeinkey-pairformat.ThusakeyrequirementfortheuseofMapReduceparallelprocessingsystemisthattheinputdataandoutputdatamustbothberepresentedinkey-valuesformats.

Mapstepreadsdatainkey-valuepairformat.Theprogrammerdecidewhatshouldbethecharacteristicsofthekeyandvaluefields.TheMapstepproducesresultsinkey-valuepairformat.However,thecharacteristicsofthekeysproducedbytheMapstep,i.e.theintermediateresults,neednotbesamekeysattheinputdata.So,thosecanbecalledkey2-value2pairs.

TheReducestepreadsthekey2-value2pairs,theintermediateresultsproducedbytheMapstep.Reducestepwillproduceanoutputusingthesamekeysthatitread.Onlythevaluesassociatedwiththosekeyswillchangethoughasaresultofprocessing.Thusitcan

belabeledaskey2-value3format.

Supposethetextinthemyfile.txtcanbesplitinto4approximatelyequalsegments.Itcouldbedonewitheachsentenceasaseparatepieceoftext.Thefoursegmentswilllookasfollowing:

Segment1:Wearegoingtoapicnicnearourhouse.

Segment2:Manyofourfriendsarecoming.

Segment3:Youarewelcometojoinus.

Segment4:Wewillhavefun.

Thustheinputtothe4processorsintheMapStepwillbeinkey-valuepairformat.Thefirstcolumnisthekey,whichistheentiresentenceinthiscase.Thesecondcolumnisthevalue,whichinthisapplicationisthefrequencyofthesentence.

Wearegoingtoapicnicnearourhouse. 1

Manyofourfriendsarecoming. 1

Youarewelcometojoinus. 1

Wewillhavefun. 1

Thistaskcanbedoneinparallelbyfourprocessors.Eachofthissegmentwillbetaskforadifferentprocessor.Thuseachtaskwillproduceafileofwords,withacountof1.Therewillbefourintermediatefiles,in<key,value>pairformat,shownbelow.

Key2 Value2 Key2 Value2 Key2 Value2 Key2 Value2

we 1 many 1 you 1 we 1

are 1 of 1 are 1 will 1

going 1 our 1 welcome 1 have 1

to 1 friends 1 to 1 fun 1

a 1 are 1 join 1

picnic 1 coming 1 us 1

near 1

our 1

house 1

ThesortprocessinherentwithinMapReducewillsorteachoftheintermediatefiles,andproducethefollowingsortedkey-pairvalues:

Key2 Value2 Key Value2 Key Value2 Key Value2

a 1 are 1 are 1 fun 1

are 1 coming 1 join 1 have 1

going 1 friends 1 to 1 we 1

house 1 many 1 us 1 will 1

near 1 of 1 welcome 1

our 1 our 1 you 1

picnic 1

to 1

we 1

TheReducefunctionwillreadthesortedintermediatefiles,andcombinethecountsforalltheuniquewords,toproducethefollowingoutput.Thekeysremainthesameasintheintermediateresults.However,thevalueschangeascountsfromeachoftheintermediatefilesareaddedupforeachkey.Forexample,thecountfortheword‘are’goesupto3.

Key2 Value3

a 1

are 3

coming 1

friends 1

fun 1

going 1

have 1

house 1

join 1

many 1

near 1

of 1

our 2

picnic 1

to 2

us 1

we 2

welcome 1

will 1

you 1

ThisoutputwillbeidenticaltothatproducedbytheUNIXsequenceearlier.

MapReduceprogrammingAdataprocessingproblemneedstobetransformedintotheMapReducemodel.Thefirststepistovisualizetheprocessingplanintoamapandareducestep.Whentheprocessinggetsmorecomplex,thiscomplexitycanbegenerallymanifestedinhavingmoreMapReducejobs,ormorecomplexmapandreducejobs.HavingmorebutsimplerMapReducejobsleadstomoreeasilymaintainablemapperandreducerprograms.

MapReduceDataTypesandFormats

MapReducehasasimplemodelofdataprocessing:inputsandoutputsforthemapandreducefunctionsarekey-valuepairs.ThemapandreducefunctionsinHadoopMapReducehavethefollowinggeneralform:

map:(K1,V1)→list(K2,V2)

reduce:(K2,list(V2))→list(K3,V3)

Ingeneral,themapinputkeyandvaluetypes(K1andV1)aredifferentfromthemapoutputtypes(K2andV2).However,thereduceinputmusthavethesametypesasthemapoutput,althoughthereduceoutputtypesmaybedifferentagain(K3andV3).SinceMapperandReducerareseparateclasses,thetypeparametershavedifferentscopes,

Hadoopcanprocessmanydifferenttypesofdataformats,fromflattextfilestodatabases.Aninputsplitisachunkoftheinputthatisprocessedbyasinglemap.Eachmapprocessesasinglesplit.Eachsplitisdividedintorecords,andthemapprocesseseachrecord—akey-valuepair—inturn.Splitsandrecordsarelogical:andmaymaptoafullfile,apartofafile,oracollectionoffiles.Inadatabasecontext,asplitmightcorrespondtoarangeofrowsfromatableandarecordtoarowinthatrange

WritingMapReduceProgramming

Startbywritingpseudocodeforthemapandreducefunctions.TheprogramcodeforboththemapandthereducefunctioncanthenbewritteninJavaorotherlanguages.InJava,themapfunctionisrepresentedbythegenericMapperclass.Itusesfourparameters:inputkey,inputvalue,outputkey,outputvalue.Thisclassusesanabstractmap()method.Thismethodreceivedtheinputkeyandinputvalue.Itwouldnormallyproduceandoutputkeyandoutputvalue.Formorecomplexproblems,itisbettertouseahigher-levellanguagethanMapReduce,suchasPig,Hive,Cascading,Crunch,orSpark.

Amappercommonlyperformsinputformatparsing,projection(selectingtherelevantfields),andfiltering(selectingtherecordsofinterest).Thereducertypicallycombines

(addsoraverages)thosevalues.

Figure5‑0‑2:MapReduceprogramFlow

Herebelowisthestep-by-steplogicImaginethatwewanttodoawordcountofalluniquewordsinatext.

1. Thebigdocumentissplitintomanysegments.Themapstepisrunoneachsegmentofdata.Theoutputwillbeasetofkey,valuepairs.Inthiscase,thekeywillbeawordinthedocument.

2. Thesystemwillgatherthekey,valuepairoutputsfromallthemappers,andwillsortthembykey.Thesortedlistitselfmaythenbesplitintoafewsegments.

3. AReducertaskwillreadthesortedlistandproduceacombinedlistofwordcounts.

HereistheJavacodeforwordcount:.

map(Stringkey,Stringvalue):

foreachwordwinvalue:

EmitIntermediate(w,“1”);

reduce(Stringkey,Iteratorvalues):

intresult=0;

foreachvinvalues:

result+=ParseInt(v);

Emit(AsString(result));

TestingMapReducePrograms

Mapperprogramsrunningonaclustercanbecomplicatedtodebug.Thetime-honored

wayofdebuggingprogramsisviaprintstatements.However,withtheprogramseventuallyrunningontensorthousandsofnodes,itisbesttodebugtheprogramsinstages.Therefore,runtheprogramusingsmallsampledatasetstoensurethattheprogramisworkingcorrectly.Expandtheunitteststocoverlargerdatasetandrunitonacluster.Ensurethatthemapperorreducercanhandletheinputscorrectly.Runningagainstthefulldatasetislikelytoexposesomemoreissues,whichshouldbefixed,byalteringyourmapperorreducertohandlethenewcases.Aftertheprogramisworking,theprogrammaybetunedtomaketheentireMapReducejobrunfaster.

Itmaybedesirabletosplitthelogicintomanysimplemappersandchainingthemintoasinglemapperusingafacility(theChainMapperlibraryclass)builtintoHadoop.Itcanrunachainofmappers,followedbyareducerandanotherchainofmappers,inasingleMapReducejob.

MapReduceJobsExecution

AMapReducejobisspecifiedbytheMapprogramandtheReduceprogram,alongwiththedatasetsassociatedwiththatjob.ThereisanothermasterprogramthatresidesandrunsendlesslyontheNameNode.ItiscalledtheJobtracker,andittrackstheprogressoftheMapReducejobsfrombeginningtothecompletion.Hadoopdividesthejobintotwotasks:maptasksandreducetasks.HadoopmovestheMapandReducecomputationlogictoeachDataNodethatishostingapartofthedata.ThecommunicationbetweenthenodesisaccomplishedusingYARN,Hadoop’snativeresourcemanager.

Themastermachine(NameNode)iscompletelyawareofthedatastoredoneachoftheworkermachines(DataNodes).Itschedulesthemaporreducejobstotasktrackerswithfullawarenessofthedatalocation.Forexample:ifnodeAcontainsdata(x,y,z)andnodeBcontainsdata(a,b,c),thejobtrackerschedulesnodeBtoperformmaporreducetaskson(a,b,c)andnodeAwouldbescheduledtoperformmaporreducetaskson(x,y,z).Thisreducesthedatatrafficandpreventschokingofthenetwork.

EachDataNodehasamasterprogramcalledtheJobtracker.ThisprogrammonitorstheexecutionofeverytaskassignedtoitbytheNameNode.Whenthetaskiscompleted,theTasktrackersendsacompletionmessagetotheJobTrackerprogramonthe

Thejobsandtasksworkinamaster-slavemode.

Figure5‑0‑3:HierarchicalMonitoringArchitecture

WhenthereismorethanonejobinaMapReduceworkflow,itisnecessarytheybeexecutedintherightorder.Foralinearchainofjobsitmightbeeasy.Foramorecomplexdirectedacyclicgraph(DAG)ofjobs,therearelibrariesthatcanhelporchestrateyourworkflow.OronecanuseApacheOozie,asystemforrunningworkflowsofdependentjobs.

Oozieconsistsoftwomainparts:aworkflowenginethatstoresandrunsworkflowscomposedofdifferenttypesofHadoopjobs(MapReduce,Pig,Hive,andsoon),andacoordinatorenginethatrunsworkflowjobsbasedonpredefinedschedulesanddata

availability.Ooziehasbeendesignedtoscale,anditcanmanagethetimelyexecutionofthousandsofworkflowsinaHadoopcluster.

ThedatasetfortheMapReducejobisdividedintofixed-sizepiecescalledinputsplits,orjustsplits.Hadoopcreatesonemaptaskforeachsplit,whichrunstheuser-definedmapfunctionforeachrecordinthesplit.ThetasksarescheduledusingYARNandrunonnodesinthecluster.YARNensuresthatifataskfailsorinordinatelydelayed,itwillbeautomaticallyscheduledtorunonadifferentnode.Theoutputsofthemapjobsarefedasinputtothereducejob.Thatlogicisalsopropagatedtothenode(s)thatwilldothereducejobs.Tosaveonbandwidth,Hadoopallowstheuseofacombinerfunctiononthemapoutput.Thenthecombinerfunction’soutputformstheinputtothereducefunction.

HowMapReduceWorks

AMapReducejobcanbeexecutedwithasinglemethodcall:submit()onaJobobject.WhentheresourcemanagerreceivesacalltoitssubmitApplication()method,ithandsofftherequesttotheYARNscheduler.Theschedulerallocatesacontainer,andtheresourcemanagerthenlaunchestheapplicationmaster’sprocess.TheapplicationmasterforMapReducejobsisaJavaapplicationwhosemainclassisMRAppMaster.Itinitializesthejobbycreatinganumberofbookkeepingobjectstokeeptrackofthejob’sprogress.Itretrievestheinputsplitscomputedintheclientfromthesharedfilesystem.Itthencreatesamaptaskobjectforeachsplit,aswellasanumberofreducetaskobjectsdeterminedbythemapreduce.job.reducesproperty(setbythesetNumReduceTasks()methodonJob).TasksaregivenIDsatthispoint.TheapplicationmastermustdecidehowtorunthetasksthatmakeuptheMapReducejob.Theapplicationmasterrequestscontainersforallthemapandreducetasksinthejobfromtheresourcemanager.Onceataskhasbeenassignedresourcesforacontaineronaparticularnodebytheresourcemanager’sscheduler,theapplicationmasterstartsthecontainerbycontactingthenodemanager.ThetaskisexecutedbyaJavaapplicationwhosemainclassisYarnChild.

ManagingFailures

Therecanbefailuresattheleveloftheentirejoborparticulartasks.Theentireapplicationmasteritselfcouldfail.

Taskfailureusuallyhappenswhentheusercodeinthemaporreducetaskthrowsaruntimeexception.Ifthishappens,thetaskJVMreportstheerrortoitsparentapplicationmaster,whereitisloggedintoerrorlogs.Theapplicationmasterwillthenrescheduleexecutionofthetaskonanotherdatanode.

Theentirejob,i.e.MapReduceapplicationmasterapplicationrunningonYARN,toocanfail.Inthatcase,itisstartedagain,subjecttoamaximumnumberwhichisauser-setconfigurationparameter.

Ifadatanodemanagerfailsbycrashingorrunningveryslowly,itwillstopsendingheartbeatstotheresourcemanager(orsendthemveryinfrequently).Theresourcemanagerwillthenremoveitfromitspoolofnodestoschedulecontainerson.Anytaskorapplicationmasterrunningonthefailednodemanagerwillberecoveredusingerrorlogs,andstartedonothernodes.

ResourceManagerYARNcanalsofail,andithasmoresevereconsequencesfortheentirecluster.Therefore,typically,therewillbeahot-standbyforYARN.Iftheactiveresourcemanagerfails,thenthestandbycantakeoverwithoutasignificantinterruptiontotheclient.Thenewresourcemanagercanreadtheapplicationinformationfromthestatestore,andthenrestarttheapplicationthatwererunningonthecluster.

ShuffleandSort

MapReduceguaranteesthattheinputtoeveryreducerissortedbykey.Theprocessbywhichthesystemperformsthesort—andtransfersthemapoutputstothereducersasinputs—isknownastheshuffle.

Whenthemapfunctionstartsproducingoutput,itisnotdirectlywrittentodisk.Thetakesadvantageofbufferingwritesinmemoryanddoingsomepresortingforefficiencyreasons.Eachmaptaskhasacircularmemorybufferthatitwritestheoutputto.Beforeitwritestodisk,thethreadfirstdividesthedataintopartitionscorrespondingtothereducersthattheywillultimatelybesentto.Withineachpartition,thebackgroundthreadperformsanin-memorysortbykey.Ifthereisacombinerfunction,itisrunontheoutputofthesortsothatthereislessdatatotransfertothereducer.

Thereducetaskneedsthemapoutputforitsparticularpartitionfromseveralmaptasksacrossthecluster.Themaptasksmayfinishatdifferenttimes,sothereducetaskstartsreadingtheiroutputsassoonaseachcompletes.Whenallthemapoutputshavebeenread,thereducetaskmergesthemapoutputs,maintainingtheirsortordering.Thereducefunctionisinvokedforeachkeyinthesortedoutput.TheoutputofthisphaseiswrittendirectlytotheoutputfilesystemsuchasHDFS.

ProgressandStatusUpdates

MapReducejobsarelong-runningbatchjobs,takingalongtimetorun.Itisimportantfor

theusertogetfeedbackonhowthejob’sprogress.Ajobandeachofitstaskshaveastatusvalue(e.g.,running,successfullycompleted,failed),theprogressofmapsandreduces,thevaluesofthejob’scounters.Thesevaluesareconstantlycommunicatedbacktotheclient.Whentheapplicationmasterreceivesanotificationthatthelasttaskforajobiscomplete,itchangesthestatusforthejobto“successful.”Jobstatisticsandcountersarecommunicatedtotheuser.

Hadoopcomeswithanativeweb-basedGUIfortrackingtheMapReducejobs.Itdisplaysusefulinformationaboutajob’sprogresssuchashowmanytaskshavebeencompleted,andwhichonesarestillbeingexecuted.Oncethejobiscompleted,onecanviewthejobstatisticsandlogs.

HadoopStreamingHadoopStreamingusesstandardUnixstreamsastheinterfacebetweenHadoopanduserprogram.Streamingisanidealapplicationfortextprocessing.Mapinputdataispassedoverstandardinputtoyourmapfunction,whichprocessesitlinebylineandwriteslinestostandardoutput.Amapoutputkey-valuepairiswrittenasasingletab-delimitedline.Inputtothereducefunctionisinthesameformat—atab-separatedkey-valuepair—passedoverstandardinput.Thereducefunctionreadslinesfromstandardinput,whichtheframeworkguaranteesaresortedbykey,andwritesitsresultstostandardoutput.

Conclusion

MapReduceisthefirstpopularparallelprogrammingframeworkforBigData.Itworkswellforapplicationswherethedatacanbelarge,anddivisibleintoseparatesets,andrepresentedin<key,value>pairformat.Theapplicationlogicisdividedintotwoparts:aMapprogramandaReduceProgram.Eachoftheseprogramscanberuninparallelbyseveralmachines.

ReviewQuestions

Q1:WhatisMapReduce?Whatareitsbenefits?

Q2:Whatisthekey-valuepairformat?Howisitdifferentfromotherdatastructures?Whatareitsbenefits?Andlimitations.

Chapter6–NoSQLdatabasesANoSQLdatabaseisacleverwaytocost-effectivelyorganizelargeamountsofheterogeneousdataforefficientaccessandupdates.TheidealNoSQLdatabaseiscompletelyalignedwiththenatureoftheproblemsbeingsolved,andissuperfastinthattask.Thisisachievedbyreleasingandrelaxingmanyoftheintegrityandredundancyconstraintsofstoringdatainrelationaldatabases,andstoringdatainmanyinnovativeformatsasalignedwithbusinessneed.ThediverseNoSQLdatabaseswillultimatelycollectiveevolveintoaholisticsetofefficientandelegantdatastructuresattheheartofacosmiccomputerofinfiniteorganizationcapacity.

IntroductionRelationaldatamanagementsystems(RDBMS)areapowerfulanduniversallyuseddatabasetechnologybyalmostallenterprises.Relationaldatabasesarestructuredandoptimizedtoensureaccuracyandconsistencyofdata,whilealsoeliminatinganyredundancyofdata.Thesedatabasesarestoredonthelargestandmostreliableofcomputerstoensurethatthedataisalwaysavailableatagranularlevelandatahighspeed.

Bigdataishoweveramuchlargerandunpredictablestreamofdata.Relationaldatabasesareinadequateforthistask,andwillalsobeveryexpensiveforsuchlargedatavolumes.Managingthecostsandspeedofmanagingsuchlargeandheterogeneousdatastreamsrequiresrelaxingmanyofthestrictrulesandrequirementsofrelationaldata.Dependinguponwhichconstraint(s)arerelaxed,adifferentkindofdatabasestructurewillemerge.ThesearecalledNoSQLdatabases,todifferentiatethemfromrelationaldatabasesthatuseStructuredQueryLanguage(SQL)astheprimarymeanstomanipulatedata.

NoSQLdatabasesarenext-generationdatabasesthatarenon-relationalintheirdesign.ThenameNoSQLismeanttodifferentiateitfromantiquated,‘PRE-relational’databases.Today,almosteveryorganizationthatneedstogathercustomerfeedbackandsentimentstoimprovetheirbusiness,willuseaNoSQLdatabase.NoSQLisusefulwhenanenterpriseneedstoaccess,analyzeandutilizemassiveamountsofeitherstructuredorunstructureddataordatathat’sstoredremotelyinanyvirtualserveracrosstheglobe.

Theconstraintsofarelationaldatabasearerelaxedinmanyways.Forexample,relationaldatabasesrequirethatanydataelementcouldberandomlyaccessedanditsvaluecouldbeupdatedinthatsamephysicallocation.However,thesimplephysicsofstoragesaysthatitissimplerandfastertoreadorwritesequentialblocksofdataonadisk.Therefore,NoSQLdatabasefilesarewrittenonceandalmostneverupdatedinplace.Ifanewversionofapartofthedatabecomeavailable,itwouldbestoredelsewherebythesystem.Thesystemwouldhavetheintelligencetolinktheupdateddatatotheolddata.

PigandHivearetwokeyandpopularlanguagesintheHadoopecosystemthatworkswellonNoSQLdatabases.PigoriginatedatYahoowhileHiveoriginatedatFacebook.BothPigandHivecanusethesamedataasaninput,andcanachievesimilarresultswithqueries.BothPigLatinandHivecommandseventuallycompiletoMapandReducejobs.Theyhaveasimilargoal-toeasethecomplexityofwritingcomplexjavaMapReduceprograms.MostMapReducejobscanbeimplementedeasilyinHiveorPig.

http://searchbusinessanalytics.techtarget.com/definition/unstructured-data

Foranalyticalneeds,HiveispreferableoverPig.Forcontrolledprocessing,Pig’sscriptingdesignispreferableHiveleadstoeaseandproductivityusingitsSQLlikedesignanduserinterface.Pigoffersgreatercontroloverdataflows.JavaMRcanbeusedformoreadvancedAPIstoaccomplishthingswhenthereissomethingspecialneeded,suchasinteractingwithathird-partytool,orsomespecialdatacharacteristics.

RDBMSVsNoSQLTheyaredifferentinmanyways.First,NoSQLdatabases,donotsupportrelationalschemaortheSQLlanguage.ThetermNoSQLstandsmostlyfor“NotonlySQL”.Second,theirtransactionprocessingcapabilitiesarefastbutweak,andtheydonotsupporttheACID(Atomicity,Consistency,Isolation,Durability)propertiesassociatedwithtransactionprocessingusingrelationaldatabases.Instead,theyareapproximatelyaccurateatanypointintime,andwillbeeventuallyconsistent.Third,thesedatabasesarealsodistributedandhorizontallyscalabletomanageweb-scaledatabasesusingHadoopclustersofstorage.Thustheyworkwellwiththewrite-once,read-manystoragemechanismofHadoopclusters.

Feature RDBMS NoSQL

Applications MostlycentralizedApplications(e.g.ERP)

Mostlydesignedforthedecentralizedapplications(e.g.Web,mobile,sensors)

Availability Moderatetohigh Continuousavailabilitytoreceiveandservedata

Velocity Moderatevelocityofdata Highvelocityofdata(devices,sensors,socialmedia,etc.).Lowlatencyofaccess.

DataVolume Moderatesize;archivedafterforacertainperiod

Hugevolumeofdata,storedmostlyforalongtimeorforever;LinearlyscalableDB.

DataSources Dataarrivesfromoneorfew,mostlypredictablesources

Dataarrivesfrommultiplelocationsandareofunpredictablenature

Datatype Dataaremostlystructured Structuredorunstructureddata

DataAccess Primaryconcernisreadingthedata

Concernisbothreadandwrite

Technology Standardizedrelationalschemas;SQLlanguage

Manydesignswithmanyimplementationsofdatastructuresandaccesslanguages

Cost Expensive;commercial Low;open-sourcesoftware

TypesofNoSQLDatabasesThevarietyofbigdatameansthatfilesizeandtypeswillvaryenormously.Therearespecializeddatabasestosuitdifferentpurposes.

1. DocumentDatabases:Storinga10GBvideomoviefileasasingleobjectcouldbespeededupbysequentiallystoringthedataincontiguousblocksofphysicalstorage.Anindexcouldstoretheidentifyinginformationaboutthemovie,andtheaddressofthestartingblock.Therestofstoragedetailscouldbehandledbythesystem.Thisstorageformatwouldbeacalleddocumentstoreformat.Theindexwouldcontainthenameofthemovie,andthevalueistheentirevideofile,characterizedbythefirstblockofstorage.Documentdatabasesaregenerallyusefulforcontentmanagementsystems,bloggingplatforms,webanalytics,real-timeanalytics,ecommerce-applications.Wewouldavoidusingdocumentdatabasesforsystemsthatneedcomplextransactionsspanningmultipleoperationsorqueriesagainstvaryingaggregatestructures.

2. Key-ValuePairDatabases:Therecouldbeacollectionofmanydataelementssuchasacollectionoftextmessageswhichcouldalsofitintoasinglephysicalblockofstorage.Eachtextmessageisauniqueobject.Thisdatawouldneedtobequeriedoften.Thatcollectionofmessagescouldalsobestoredinakey-valuepairformat,bycombiningtheidentifierofthemessageandthecontentofthemessage.Key-valuedatabasesareusefulforstoringsessioninformation,userprofiles,preferences,andshoppingcartdata.Key-valuedatabasesdon’tworksowellwhenweneedtoquerybynon-keyfieldsoronmultiplekeyfieldsatthesametime.

3. GraphDatabases:Geographicmapdatathatisstoredinsetofrelationshipsorlinksbetweenpoints.Graphdatabasesareverywellsuitedtoproblemspaceswherewehaveconnecteddata,suchassocialnetworks,spatialdata,routinginformation,andrecommendationengines.

4. ColumnarDatabases:Somekindofdatabasesareneededtospeedupsomeoft-soughtqueriesfromverylargedatasets.Supposethereisanextremelylargedatawarehouseofweblogaccessdata,whichisrolledupbythenumberofwebaccessbythehour.Thisneedstobequeried,orsummarizedoften,involvingonlysomeofthedatafieldsfromthedatabase.Thusthequerycouldbespeededupbycreatingadatabasestructurethatincludedonlytherelevantcolumnsofthedataset,alongwiththekeyidentifyinginformation.Thisiscalledacolumnardatabaseformat,andisusefulforcontentmanagementsystems,bloggingplatforms,maintainingcounters,expiringusage,heavywritevolumesuchaslog

aggregation.Columnfamilydatabasesforsystemswellwhenthequerypatternshavestabilized.

ThechoiceofNoSQLdatabasedependsonthesystemrequirements.Thereareatleast200implementationsofNoSQLdatabasesofthesefourtypes.Visitnosql-database.orgformore.

Despitethename,aNoSQLdatabasedoesnotnecessarilyprohibitstructuredquerylanguage(likeMySQL).WhilesomeoftheNoSQLsystemsareentirelynon-relational,othersjustavoidsomeselectedfunctionalityofRDMSsuchasfixedtableschemasandjoinoperations.ForNoSQLsystems,insteadofusingtables,thedatacanbeorganizedthedatainkey/valuepairformat,andthenSQLcanbeused.

ThefirstpopularNoSQLdatabasewasHBase,whichisapartoftheHadoopfamily.ThemostpopularNoSQLdatabaseusedtodayisApacheCassandra,whichwasdevelopedandownedbyFacebooktillitwasreleasedasopensourcein2008.OtherNoSQLdatabasesystemsareSimpleDB,Google’sBigTable,MemcacheDB,OracleNoSQL,Voldemort,etc.

http://nosql-database.org/

http://searchsqlserver.techtarget.com/answer/What-is-SQL

ArchitectureofNoSQL

Figure6‑0‑1:NoSQLDatabasesArchitecture

OneofthekeyconceptsunderlyingtheNoSQLdatabasesisthatdatabasemanagementhasmovedtoatwo-layerarchitecture;separatingtheconcernsofdatamodelinganddatastorage.Thedatastoragelayerfocusesonthetaskofhigh-performancescalabledatastorageforthetaskathand.Thedatamanagementlayeravarietyofdatabaseformats,andallowsforlow-levelaccesstothatdatathroughspecializedlanguagesthataremoreappropriateforthejob,ratherthanbeingconstrainedbyusingthestandardSQLformat.

NoSQLdatabasesmapsthedatainthekey/valuepairsandsavesthedatainthestorageunit.Thereisnostorageofdatainacentralizedtabularform,sothedatabaseishighlyscalable.Thedatacouldbeofdifferentforms,andcomingfromdifferentsources,andtheycanallbestoredinsimilarkey/valuepairformats.

ThereareavarietyofNoSQLarchitectures.SomepopularNoSQLdatabaseslikeMongoDBaredesignedinamaster/slavemodellikemanyRDBMS.ButotherpopularNoSQLdatabaseslikeCassandraaredesignedinamaster-lessfashionwhereallthenodesintheclustersarethesame.So,itisthearchitectureoftheNoSQLdatabasesystemthatdeterminesthebenefitsofdistributedandscalablesystememergeslikecontinuousavailability,distributedaccess,highspeed,andsoon.

NoSQLdatabasesprovidedeveloperslotofoptionstochoosefromandfinetunethesystemtotheirspecificrequirements.Understandingtherequirementsofhowthedataisgoingtobeconsumedbythesystem,questionssuchasisitreadheavyvswriteheavy,isthereaneedtoquerydatawithrandomqueryparameters,willthesystembeablehandleinconsistentdata.

CAPtheoremDataisexpectedtobeaccurateandavailable.Inadistributedenvironment,accuracydependsupontheconsistencyofdata.AsystemisconsideredConsistentifallreplicasofcopycontainthesamevalue.ThesystemisconsideredAvailable,ifthedataIisavailableatallpointsintime.Itisalsodesirableforthedatatobeconsistentandavailableevenwhenanetworkfailurerendersthedatabasepartitionedintotwoormoreislands.Asystemisconsideredpartitiontolerantifprocessingcancontinueinbothpartitionsinthecaseofanetworkfailure.Inpracticeitishardtoachieveallthree.

ThechoicebetweenConsistencyandAvailabilityremainstheunavoidablerealityfordistributeddatastores.CAPtheoremstatesthatinanydistributedsystemonecanchooseonlytwooutofthethree(Consistency,AvailabilityandPartitionTolerance).Thethirdwillbedeterminedbythosechoices.

NoSQLdatabasescanbetunedtosuitone’schoiceofhighconsistencyoravailability.Forexample,foraNoSQLdatabase,thereareessentiallythreeparameters:

-N=replicationfactor,i.e.thenumberofreplicascreatedforeachpieceofdata

-R=Minimumnumberofnodesthatshouldrespondtoareadrequestforittobeconsideredsuccessful

-W=Minimumnumberofnodesthatshouldrespondtoawriterequestbeforeitsconsideredsuccessful.

SettingthevaluesofRandWveryhigh(R=N,andW=N)willmakethesystemmoreconsistent.However,itwillbeslowtoreportConsistency,andthusAvailabilitywillbelow.Ontheotherend,settingRandWtobeverylow(suchasR=1andW=1),wouldmaketheclusterhighlyavailable,asevenasinglesuccessfulread(orwrite)wouldlettheclustertoreportsuccess.However,consistencyofdataontheclusterwillbelowsincemanyofthemaynothaveyetreceivedthelatestcopyofthedata.

Ifanetworkgetspartitionedbecauseofanetworkfailure,thenonehastotradeoffavailabilityversusconsistency.NoSQLdatabaseusersoftenchooseavailabilityandpartitiontoleranceoverstrongconsistency.Theyarguethatshortperiodsofapplicationmisbehaviorarelessproblematicthanshortperiodsofunavailability.

Consistencyismoreexpensiveintermsofthroughputorlatency,thanisAvailability.However,HDFSchoosesconsistency–asthreefaileddatanodescanpotentiallyrendera

file’sblockscompletelyunavailable.

PopularNoSQLDatabasesWecovertwoofthemorepopularofferings.

HBaseApacheHBaseisacolumn-oriented,non-relational,distributeddatabasesystemthatrunsontopofHDFS.AnHBasesystemcomprisesasetoftables.Eachtablecontainsrowsandcolumns,muchlikeatraditionaldatabase.EachtablemusthaveanelementdefinedasaPrimaryKey;allaccesstoHBasetablesisdoneusingthePrimaryKey.AnHBasecolumnrepresentsanattributeofanobject.Forexample,ifthetableisstoringdiagnosticlogsfromwebservers,eachrowwillbealogrecord.Eachcolumninthattablewillrepresentanattributesuchasthedate/timeoftherecord,ortheservername.HBasepermitsmanyattributestobegroupedtogetherintoacolumnfamily,sothatallelementsofacolumnfamilyareallstoredasessentiallyacompositeattribute.

Columnardatabasesaredifferentfromarelationaldatabaseintermsofhowthedataisstored.Intherelationaldatabase,allthecolumns/attributesofagivenrowarestoredtogether.WithHBaseyoumustpredefinethetableschemaandspecifythecolumnfamilies.Allrowsofacolumnfamilywillstoredsequentially.However,it’sveryflexibleinthatnewcolumnscanbeaddedtofamiliesatanytime,makingtheschemaflexibleandthereforeabletoadapttochangingapplicationrequirements.


HBaseisbuiltonmaster-slaveconcept.InHBaseamasternodemanagesthecluster,whiletheworkernodes(calledregionservers)storeportionsofthetablesandperformtheworkonthedata.HBaseisdesignedafterGoogleBigtable,andofferssimilarcapabilitiesontopofHadoopandHDFS.Itdoesconsistentreadsandwrites.Itdoesautomaticandconfigurableshardingoftables.Ashardisasegmentofthedatabase.

Figure6‑0‑2:HBASEArchitecture

Physically,HBaseiscomposedofthreetypesofserversinamasterslavetypeofarchitecture.

http://www.ibm.com/software/data/infosphere/hadoop/hdfs/

(a)TheNameNodemaintainsmetadatainformationforallthephysicaldatablocksthatcomprisethefiles.

(b)Regionserversservedataforreadsandwrites.

(c)TheHadoopDataNodestoresthedatathattheRegionServerismanaging.

HBaseTablesaredividedhorizontallybyrowkeyrangeinto“Regions.”Aregioncontainsallrowsinthetablebetweentheregion’sstartkeyandendkey.Regionassignment,DDL(create,deletetables)operationsarehandledbytheHBaseMasterprocess.Zookeeper,whichispartofHDFS,maintainsaliveclusterstate.ThereisanautomaticfailoversupportbetweenRegionServers.AllHBasedataisstoredinHDFSfiles.RegionServersarecollocatedwiththeHDFSDataNodes,whichenabledatalocality(puttingthedataclosetowhereitisneeded)forthedataservedbytheRegionServers.HBasedataislocalwhenitiswritten,butwhenaregionismoved,itisnotlocaluntilcompaction.

EachRegionServercreatesanephemeralnode.TheHMastermonitorsthesenodestodiscoveravailableregionservers,anditalsomonitorsthesenodesforserverfailures.

Amasterisresponsibleforcoordinatingtheregionservers,includingassigningregionsonstartup,loadbalancingofrecoveryamongregions,andmonitoringtheirhealth.Itisalsotheinterfaceforcreating,deleting,updatingtables


ThereisaspecialHBaseCatalogtablecalledtheMETAtable,whichholdsthelocationoftheregionsinthecluster.ZooKeeperstoresthelocationoftheMETAtable.

ThisiswhathappensthefirsttimeaclientreadsorwritestoHBase:

TheclientgetstheRegionserverthathoststheMETAtablefromZooKeeper.

Theclientwillquerythe.META.servertogettheregionservercorrespondingtotherowkeyitwantstoaccess.TheclientcachesthisinformationalongwiththeMETAtablelocation.

ItwillgettheRowfromthecorrespondingRegionServer.

Forfuturereads,theclientusesthecachetoretrievetheMETAlocationandpreviouslyreadrowkeys.Overtime,itdoesnotneedtoquerytheMETAtable,unlessthereisamissbecausearegionhasmoved;thenitwillre-queryandupdatethecache.

CassandraApacheCassandraisalargelyscalableopensourcenon-relationaldatabasethatofferscontinuousuptime,simplicityandeasydatadistributionacrossmultipledatacentersandcloud.CassandrawasoriginallydevelopedatFacebookandwasopensourcedin2008.Itprovidesmanybenefitsoverthetraditionalrelationaldatabasesformodernonlineapplicationslikescalablearchitecture,continuousavailability,highdataprotection,multidatareplicationsoverdatacenters,datacompression,SQLlikelanguageandsoon.


Cassandraarchitectureprovidesitsabilitytoscaleandprovidecontinuousavailability.Ratherthanusingmaster-slavearchitecture,ithasamaster-less“ring”designthatiseasytosetupandmaintain.InCassandra,allnodesplayanequalrole,allnodescommunicatewithoneanotherbyadistributedandhighlyscalableprotocolcalledgossip.

So,theCassandrascalablearchitectureprovidesthecapacityofhandlinglargevolumeofdata,andlargenumberofconcurrentusersoroperationsoccurringatthesametime,acrossmultipledatacenters,justaseasilyasanormaloperationfortherelationaldatabases.Toenhanceitscapacity,onesimplyneedstoaddnewnodestoanexistingclusterwithouttakingdownthesystemanddesigningfromthescratch.

AlsotheCassandraarchitecturemeansthatunlikeothermasterslavesystems,ithasnosinglepointoffailureandthusiscapableofofferingcontinuousavailabilityanduptime.


DatatobewrittentoaCassandranodeisfirstrecordedinanondiskcommitlogandthenitiswrittentoamemorybasedunitcalleda“memTable”.Whena“memTable”sizeexceedsacertainsetthreshold,thedataisthenwrittentofileondiskcalledan“SSTable”.Thus,inthiswaythewriteoperationisfullysequentialinnature.withmanyinputoutputoperationoccurringatthesametime,ratherthanoccurringoneatatimeoveralongperiod.

Forareadoperation,Cassandralooksinaninmemorydatastructurecalleda“Bloomfilter”thatfetchtheprobabilityofa“SSTable”havingtherequireddata.TheBloomfiltercanperformthetaskveryquicklytotellifafilehastheneededdataornot.IfitreturntruethenCassandralooksforanotherlayerofinmemorycaches,andthenfetchesthecompresseddataondisk.Iftheanswerisfalse,Cassandradoesn’tbotherwithreadingthe“SSTable”andlooksforanotherfiletofetchtherequireddata.

WriteSyntax:TTransporttr=newTSocket(HOST,PORT);

TFramedTransporttf=newTFramedTransport(tr);TProtocolprotocal=newTBinaryProtocol(tf);Cassandra.Clientclient=newCassandra.Client(protocal);

tf.open();

client.insert(userIDKey,cp,newColumn(“Colume-name”.getBytes(UTF8),“Colume-data”.getBytes(),clock),CL);

ReadSyntax:

Columncol=client.get(userIDKey,colPathName,CL).getColumn();

LOG.debug(“Columnname:”+newString(col.Colume-name,UTF8));

LOG.debug(“Columnvalue:”+newS tring(col.Colume-data,UTF8));

http://en.wikipedia.org/wiki/Bloom_filter

HiveLanguageHiveisadeclarativeSQL-likelanguageforqueries.HivewasdesignedtoappealtoacommunitycomfortablewithSQL.Itisusedmainlybydataanalystsontheserverside,fordesigningreports.Ithasitsownmetadatasectionwhichcanbedefinedaheadoftime,beforedataisloaded.Hivesupportsmapandreducetransformscriptsinthelanguageoftheuser’schoice,whichcanbeembeddedwithinSQLclauses.ItiswidelyusedinFacebookbyanalystscomfortablewithSQL,aswellasbydataminersprogramminginPython.Hiveisbestusedfortraditionaldatawarehousingtasks;itisnotdesignedforonlinetransactionprocessing.

Hiveisbestsuitedforstructureddata.HivecanbeusedtoquerydatastoredinHbase,whichisakey-valuestore.Hive’sSQL-likestructuremakestransformationofdatatoandfromRDBMSiseasier.SupportingSQLsyntaxalsomakesiteasytointegratewithexistingBItools.Hiveneedsthedatatobefirstimported(orloaded)andafterthatitcanbeworkedupon.Incaseofstreamingdata,onewouldhavetokeepfillingbuckets(orfiles),andthenHivecanbeusedtoprocesseachfilledbucket,whileusingotherbucketstokeepstoringthenewlyarrivingdata.

HivedataColumnsaremappedtotablesinHDFS.ThismappingisstoredinMetadata.AllHQLqueriesareconvertedtoMapReducejobs.Atablecanhaveonemorepartitionkeys.ThereareusualSQLdatatypes,andArraysandMapsandStructstorepresentmorecomplextypesofdata.Thereareuserdefinedfunctionsformapping,aggregating

Figure6‑3:HiveArchitecture

HIVELanguageCapabilities

Hive’sSQLprovidesalmostallbasicSQLoperations.Theseoperationsworkontablesandorpartitions.Theseoperationsare:SELECT,FROM,WHERE,JOIN,GROUPBY,

ORDERBY.Italsoallowstheresultstobestoredinanothertable,orinaHDFSfile.

Thestatementtocreateapage_viewtablewouldbelike:

CREATETABLEpage_view(viewTimeINT,useridBIGINT,

page_urlSTRING,referrer_urlSTRING,

ipSTRINGCOMMENT‘IPAddressoftheUser’)

COMMENT‘Thisisthepageviewtable’

PARTITIONEDBY(dtSTRING,countrySTRING)

STOREDASSEQUENCEFILE;

Hereisascriptforloadingdataintothisfile.

CREATEEXTERNALTABLEpage_view_stg(viewTimeINT,useridBIGINT,

page_urlSTRING,referrer_urlSTRING,

ipSTRINGCOMMENT‘IPAddressoftheUser’,

countrySTRINGCOMMENT‘countryoforigination’)

COMMENT‘Thisisthestagingpageviewtable’

ROWFORMATDELIMITEDFIELDSTERMINATEDBY‘44’LINESTERMINATEDBY‘12’

STOREDASTEXTFILE

LOCATION‘/user/data/staging/page_view’;

ThetablecreatedabovecanbestoredinHDFSasaTextFileorasaSequenceFile.

AnINSERTqueryonthistablewilllooklike:

hadoopdfs-put/tmp/pv_2008-06-08.txt/user/data/staging/page_view

FROMpage_view_stgpvs

INSERTOVERWRITETABLEpage_viewPARTITION(dt=‘2008-06-08’,country=‘US’)

SELECTpvs.viewTime,pvs.userid,pvs.page_url,pvs.referrer_url,null,null,pvs.ip

WHEREpvs.country=‘US’;

PigLanguagePigisahigh-levelprocedurallanguage.Itisusedmainlyforprogramming.Ithelpstocreateastep-by-stepflowofdatatodoprocessing.Itoperatesmostlyontheclientsideofthecluster.PigLatinfollowsaprocedureprogrammingmodelandmorenaturaltousetobuildadatapipeline,suchasETLjob.Itgivesfullcontroloverhowthedataflowsthroughthepipeline,whentocheckpointthedatainpipeline,anditsupportDAGsinpipelinesuchassplit,andgivesmorecontroloveroptimization.Pigworkswellwithunstructureddata.Forcomplexoperationssuchasanalyzingmatrices,orsearchforpatternsinunstructureddata,Pigwillgivegreatercontrolandoptions.

Pigallowsonetoloaddataandusercodeatanypointinthepipeline.Thiscanbeimportantforingestingstreamingdatafromsatellitesorinstruments.Pigalsouseslazyevaluation.PigisfasterinthedataimportbutslowerinactualexecutionthananRDBMSfriendlylanguagelikeHive.Pigiswellsuitedtoparallelizationandsoitisbettersuitedforverylargedatasetsthroughput(amountofdataprocessed)ismoreimportantthanlatency(speedofresponse).

PigisSQL-like,butdifferstoagreatextent.Itdoesnothaveadedicatedmetadatasection;theschemawillhavetobedefinedintheprogramitself.Itis.PigcanbeeasierforsomeonewhohadnotearlierexperiencewithSQL.

ConclusionNoSQLdatabasesemergedinresponsetothelimitationsofrelationaldatabasesinhandlingthesheervolume,natureandgrowthofdata.NoSQLdatabaseshavethefunctionalitylikeMapReduce.NoSQLdatabaseisprovingtobeaviablesolutiontotheenterprisedataneedsandcontinuetodoso.TherearefourtypesofNoSQLdatabases:columnar,Key-pair,document,andgraphicaldatabases.CassandraandHBaseareamongthemostpopularNOSQLdatabases.HiveisanSQL-typelanguagetoaccessdatafromNoSQLdatabases.Pigisaproceduralhigh-languagethatgivesgreatercontroloverdataflows.

ReviewQuestionsQ1:WhatisaNoSQLdatabase?Whatarethedifferenttypesofit?

Q2:HowdoesaNoSQLdatabaseleveragethepowerofMapReduce?

Q3:whatarethekindsofNoSQLdatabases?Whataretheadvantagesofeach?

Q3:WhatarethesimilaritiesanddifferencesbetweenHiveandPig?

Chapter7–StreamProcessingwithSparkAstreamprocessingsystemisacleverwaytoprocesslargequantitiesofdatafromavastsetofextremelyfastincomingdatastreams.Theidealstreamprocessingenginewillcaptureandreportinrealtimetheessenceofalldatastreams,nomatterthespeedorsizeofnumber.Thisisachievedbyusinginnovativealgorithmsandfiltersthatrelaxmanycomputationalaccuracyrequirements,tocomputesimpleapproximatemetricsinrealtime.Streamprocessingenginealignswiththeinfinitedynamismoftheflowofnature.

IntroductionApacheSparkisanintegrated,fast,in-memory,general-purposeengineforlarge-scaledataprocessing.Sparkisidealforiterativeandinteractiveprocessingtasksonlargedatasetsandstreams.Sparkachieves10-100xperformanceoverHadoopbyoperatingwithanin-memoryconstructcalled‘ResilientDistributedDatasets’,whichhelpavoidthelatenciesinvolvedindiskreadsandwrites.WhileSparkiscompatiblewithHadoopfilesystemsandtools,alargescaleadoptionofSparkanditsbuilt-inlibraries(forMachineLearning,GraphProcessing,Streamprocessing,SQL)willdeliverseamlessfastdataprocessingalongwithhighprogrammerproductivity.SparkhasbecomeamoreefficientandproductivealternativeforHadoopecosystem,andisincreasingbeingusedinindustry.

ApacheSparkwasoriginallydevelopedin2009inUCBerkeley’sAMPLab,andopensourcedin2010asanApacheproject.Itcanprocessdatafromavarietyofdatarepositories,includingtheHadoopDistributedFileSystem(HDFS),andNoSQLdatabasessuchasHBaseandCassandra.Sparksupportsin-memoryprocessingtoboosttheperformanceofbigdataanalyticsapplications,butitcanalsodoconventionaldisk-basedprocessingwhendatasetsaretoolargetofitintotheavailablesystemmemory.Sparkgivesusacomprehensive,unifiedframeworktomanagebigdataprocessingrequirementswithavarietyofdatasetsthatarediverseinnature(textdata,graphdataetc)aswellasthesourceofdata(batchv.real-timestreamingdata).SparkenablesapplicationsinHadoopclusterstorunupto100timesfasterinmemoryand10timesfasterevenwhenrunningondisk.SparkisanalternativetoHadoopMapReduceratherthanareplacementforHadoop.Itprovidesacomprehensiveandunifiedsolutiontomanagedifferentbigdatausecasesandrequirements.

SparkArchitecture

ThecoreSparkenginefunctionspartlyasanapplicationprogramminginterface(API)layerandunderpinsasetofrelatedtoolsformanagingandanalyzingdata,includingaSQLqueryengine,alibraryofmachinelearningalgorithms,agraphprocessingsystemandstreamingdataprocessingsoftware.Sparkallowsprogrammerstodevelopcomplex,multi-stepdatapipelinesusingdirectedacyclicgraph(DAG)pattern.Italsosupportsin-memorydatasharingacrossDAGs,sothatdifferentjobscanworkwiththesamedata.SparkrunsontopofexistingHadoopDistributedFileSystem(HDFS)infrastructuretoprovideenhancedandadditionalfunctionality.ItprovidessupportfordeployingSparkapplicationsinanexistingHadoopv1cluster(withSIMR–Spark-Inside-MapReduce)orHadoopv2YARNclusterorevenApacheMesos.

Nextwewillintroducethetwoimportancefeaturesinspark:RDDsandDAG.

ResilientDistributedDatasets(RDD)

RDD,ResilientDistributedDatasets,isadistributedmemorydistribution.Theyaremotivatedbytwotypesofapplicationsthatcurrentcomputingframeworkshandleinefficiently:iterativealgorithmsandinteractivedataminingtools.Inbothcases,keepingdatainmemorycanimproveperformancebyanorderofmagnitude.

RDDsareImmutableandpartitionedcollectionofrecords,whichcanonlybecreatedbycoarsegrainedoperationssuchasmap,filter,groupbyetc.Bycoarsegrainedoperations,itmeansthattheoperationsareappliedonallelementsinadataset.RDDscanonlybecreatedbyreadingdatafromastablestoragesuchasHDFSorbytransformationsonexistingRDDs.

OncedataisreadintoanRDDobjectinSpark,avarietyofoperationscanbeperformedbycallingabstractSparkAPIs.Thetwomajortypesofoperationavailablearetransformationsandactions.Transformationsreturnanew,modifiedRDDbasedontheoriginal.SeveraltransformationsareavailablethroughtheSparkAPI,includingmap(),

filter(),sample(),andunion().ActionsreturnavaluebasedonsomecomputationbeingperformedonanRDD.SomeexamplesofactionssupportedbytheSparkAPIincludereduce(),count(),first(),andforeach().

DirectedAcyclicGraph(DAG)

DAGrefersadirectedacyclicgraph.Thisapproachisanimportantfeatureforreal-timenigDataplatforms.Thosetools,includingStorm,Spark,andTez,offeramazingnewcapabilitiesforbuildinghighlyinteractive,real-timecomputingsystemstopoweryourreal-timeBI,predictiveanalytics,real-timemarketingandothercriticalsystems.

DAGScheduleristheschedulinglayerofApacheSparkthatimplementsstage-orientedscheduling,i.e.afteranRDDactionhasbeencalleditbecomesajobthatisthentransformedintoasetofstagesthataresubmittedasTaskSetsforexecution.Ingeneral,DAGSchedulerdoesthreethingsinSpark:ComputesanexecutionDAG,i.e.DAGofstages,forajob;Determinesthepreferredlocationstoruneachtaskon;Handlesfailuresduetoshuffleoutputfilesbeinglost.

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-dagscheduler.html#preferred-locations

SparkEcosystemSparkisanintegratedstackoftoolsresponsibleforscheduling,distributing,andmonitoringapplicationsconsistingofmanycomputationaltasksacrossmanyworkermachines,oracomputingcluster.SparkiswrittenprimarilyinScala,butincludescodefromPython,Java,R,andotherlanguages.Sparkcomeswithasetofintergratedtoolsthatreducelearningtimeanddeliverhigheruserproductivity.SparkecosystemincludesMesosresourcemanager,andothertools.

SparkhasalreadyovertakenHadoopingeneralbecauseofbenefitsitprovidesintermsoffasterexecutioniniterativeprocessingalgorithms.

SparkforbigdataprocessingSparksupportbigdataminingthroughrelevantlibrariesincludingMLlib,GraphXandSparkR.AndthroughSparkSQLlanguageandStreaminglibrary.

MLlib

MLlibisSpark’smachinelearninglibrary.Itconsistsofbasicmachinelearningalgorithmssuchasclassification,regression,clustering,collaborativefiltering,dimensionalityreduction,aswellaslower-leveloptimizationprimitivesandhigher-levelpipelineAPIs.Atthesametime,wecareaboutalgorithmicperformance.Sparkexcelsatiterativecomputation,enablingMLlibtorunfast.SoMLlibalsocontainshigh-qualityalgorithmsthatleverageiteration,andcanyieldbetterresultsthantheone-passapproximationssometimesusedonMapReduce.Inaddition,SparkMLlibiseasytouseanditcansupportscala,Java,Python,andSparkR.

Forexample,Decisiontreesisapopulardataclassificationtechnique,SparkMLlibcansupportdecisiontreesforbinaryandmulticlassclassification,usingbothcontinuousandcategoricalfeatures.Theimplementationpartitionsdatabyrows,allowingdistributedtrainingwithmillionsofinstances.

FunctionsinDecisionTrees

class:publicstaticDecisionTreeModeltrainClassifier(…)

Methodtotrainadecisiontreemodelforbinaryormulticlassclassification.

Parameters:

•input-Trainingdataset:RDDofLabeledPoint.Labelsshouldtakevalues{0,1,…,numClasses-1}.

•numClassesForClassification-numberofclassesforclassification.

•categoricalFeaturesInfo-Mapstoringarityofcategoricalfeatures.

•impurity-Criterionusedforinformationgaincalculation.Supportedvalues:“gini”or“entropy”

•maxDepth-Maximumdepthofthetree.(suggestedvalue:4).

•maxBins-maximumnumberofbinsusedforsplittingfeatures(suggestedvalue:100).

Returns:DecisionTreeModelthatcanbeusedforprediction

SparkGraphX

Efficientprocessingoflargegraphsisanotherimportantandchallengingissue.Many

practicalcomputingproblemsconcernlargegraphs.Forexample,googlehavetorunitsPageRankonbillionsofwebpagesandmaybetrillionsofweblinks.GraphXisanewcomponentinSparkforgraphsandgraph-parallelcomputation.Atahighlevel,GraphXextendstheSparkRDDbyintroducinganewGraphabstraction:adirectedmulti-graphwithpropertiesattachedtoeachvertexandedge.

Tosupportgraphcomputation,GraphXexposesasetoffundamentaloperatorssuchassubgraph,joinVertices,andaggregateMessagesonthebaissofanoptimizedvariantofthePregelAPI(PregelisthesystematGooglethatpowersPageRank).Inaddition,GraphXincludesagrowingcollectionofgraphalgorithmsandbuilderstosimplifygraphanalyticstasks.

WecomputethePageRankofeachuserasfollows:

//loadtheedgesasagraphobject

valgraph=GraphLoader.edgeListFile(sc,“outlink.txt”)

//Runpagerank

valranks=graph.pagerank(0.00000001).vertices

//jointherankwiththewebpages

valpages=sc.textFile(“pages.txt”).map{line=>valfields=line.split(“,”)(fields(0).toLong,fields(1))}

valranksByPagename=pages.join(ranks).map{case(id,(pagename,rank))=>(pagename,rank)}

//printtheoutput

println(rankByPagename.collect().mkString(“\n”))

SparkR

Risapopularstatisticalprogramminglanguagewithanumberofextensionsthatsupportdataprocessingandmachinelearningtasks.However,interactivedataanalysisinRisusuallylimitedastheruntimeissingle-threadedandcanonlyprocessdatasetsthatfitinasinglemachine’smemory.SparkR,anRpackageinitiallydevelopedattheAMPLab,canprovideanRfrontendtoApacheSparkandusingSpark’sdistributedcomputationengineallowsustorunlargescaledataanalysisfromtheRshell.SparkRexposestheRDDAPIofSparkasdistributedlistsinR.Forexample,onecanreadaninputfilefromHDFSandprocesseverylineusinglapplyonaRDD.Thereisacaseletasfollows:

sc<-sparkR.init(“local”)

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

http://spark.apache.org/docs/latest/graphx-programming-guide.html#property_graph

http://spark.apache.org/docs/latest/graphx-programming-guide.html#structural_operators

http://spark.apache.org/docs/latest/graphx-programming-guide.html#join_operators

http://spark.apache.org/docs/latest/graphx-programming-guide.html#aggregateMessages

http://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel

http://spark.apache.org/docs/latest/graphx-programming-guide.html#graph_algorithms

http://spark.apache.org/docs/latest/graphx-programming-guide.html#graph_builders

lines<-textFile(sc,“hdfs://data.txt”)

wordsPerLine<-lapply(lines,function(line)){length(unlist(strsplit(line,””)))})

Inadditiontolapply,SparkRalsoallowsclosurestobeappliedoneverypartitionusinglapplyWithPartition.OthersupportedRDDfunctionsincludeoperationslikereduce,reduceByKey,groupByKeyandcollect.

SparkSQL

SparkSQLisalanguageprovidedtodealwiththestructureddata.Usingthisonecanrunqueriesonthedataandgetsomemeaningfulresult.ItsupportsthequeriesthroughSQLaswellasHQL(HiveQueryLanguage)whichisApache’sHiveversionofSQL.

SparkStreaming

SparkStreaminggainsdatastreamsfrominputsources,processtheminacluster,pushouttodatabases/dashboards.Sparkfurtherchopsupdatastreamsintobatchesoffewseconds.SparktreatseachbatchofdataasRDDsandprocessesthemusingRDDoperations.Theprocessedresultsarepushedoutasbatches.

SparkapplicationsSomehotdataproblemsthataresolvedwellbyatoollikeApacheSparkinclude:1.Real-timeLogDatamonitoring.2.MassiveNaturalLanguageProcessing3.LargeScaleOnlineRecommendationSystems.

AsimpleWordcountapplicationcanberuninSparkshellasbelow.

valtextFile=sc.textFile(“C:\Users\MyName\Documents\obamaSpeech.txt”)

***Comment:savesthetextfileastextFile***

valcounts=textFile.flatMap(line=>line.split(”“)).map(word=>(word,1)).reduceByKey(_+_)

***Comment:Calculatethetotalwordsbysplittingwithspace***

counts.count();

***Resultstheoutputasbelow******

Long=52

counts.saveAsTextFile(“C:\Users\MyName\Desktop\counts1”)

***Comment:savesthefileonmyDesktop***

SparkvsHadoopSparkandHadooparebothpopularApacheprojectsdedicatedtobigdataprocessing.Hadoop,formanyyears,wastheleadingopensourcebigdataplatformandmanycompaniesalreadyuseadistributedcomputingframeworklikeHadoopbasedonMapReduce.Table9.1providesasummaryofthedifferencesbetweenHadoopandSpark.

Feature Hadoop Spark

Purpose Resilientcost-effectivestorageandprocessingoflargedatasets

Fastgeneral-purposeengineforlarge-scaledataprocessing

Corecomponent HadoopDistributedFilesystem(HDFS)

SparkCore,thein-memoryprocessingengine.

Storage HDFSmanagesmassivedatacollectionsacrossmultiplenodeswithinaclusterofcommodityservers.

Spark doesn’t do distributedstorage. It operates ondistributeddatacollections.

FaultTolerance Hadoop uses replication toachievefaulttolerance.

SparkusesRDDforfaulttolerancethatminimizesnetworkI/O.

Natureofprocessing

AccompaniedbyMapReduce,itincludesbatchprocessingofthisdatainparallelmode

Batch as well as streamprocessing.

SweetspotBatchprocessing

Iterativeandinteractiveprocessingjobs,thatcanfitinthememory

ProcessingSpeedMapReduceisslow.

Sparkcanbeupto10xfasterthanMapReduceforbatchprocessingandupto100xfasterforstreamprocessing.

Security Moresecure Lesssecure

Failurerecovery Hadoopcanrecoverfromsystemfaultsorfailuressincedataarewrittentodiskaftereveryoperation

WithSpark,dataobjectsarestoredinRDD.Thesecanbereconstructedafterfaultsorfailures

Analyticstools Built-inMLLib(Machine

http://www.ap-institute.com/big-data-articles/big-data-what-is-hadoop-%E2%80%93-an-explanation-for-absolutely-anyone.aspx

Separateengine Learning)andGraphX(GraphProcessing)libraries

Compatibility PrimarystoragemodelisHDFS CompatibilitywithHDFSandotherstorageformats

Languagesupport Java Scalaisnativelanguage.APIsforpython,java,R,others.

DrivingOrganization Yahoo AMPLabsfromUCBerkeley

Technologyowners Apache,Open-source,free Open-source,free

KeyDistributors Cloudera,Horton,MapR Databricks,AMPLabs

CostofSystem MediumtoHigh MediumtoHigh

ConclusionSparkisanewintegratedsystemforbigdataprocessing.ItsmostimportantcoreabstractionisRDDs,alongwithrelevantlibrarieslikeMLlibandGraphX.Sparkisareallypowerfulopensourceprocessingenginebuildaroundspeed,easeofuse,andsophisticatedanalytics.

ReviewQuestionsQ1:Describethesparkecosystem.

Q2:CompareSparkandHadoopintermsoftheirabilitytodostreamcomputing?

Q3:WhatisanRDD?HowdoesitmakeSparkfaster?

Q4:DescribethreemajorcapabilitiesinSparkfordataanalytics.

Chapter8–IngestingDataWholenessADataingestingsystemisareliableandefficientpointofreceptionforalldatacomingintoasystem.Thissystemisdesignedtobeflexibleandscalabletoreceivedatafromvarioussources,atvarioustimesandspeedsandquantities.Theingestsystemmakesthedataavailableforusebythetargetapplicationsinrealtime.Ideally,alldatawouldbesmoothlyreceived,andmadeavailablefordownstreamapplicationstosecurelyandreliablyaccessattheirownconvenience.Adedicatingdataingestmechanismisachievedbycreatingafastandflexiblebufferforreceivingandstoringallincomingstreamsofdata.Thedatainthebufferisstoredinasequentialmanner,andismadeavailabletoallconsumingapplicationsinafastandorderlymanner.

BigDataarrivesintoasystematunpredictablespeedsandquantities.Businessapplicationsthereafterreceiveandprocessthisdataatsomeplannedthroughputcapacity.Aningestbufferisneededtocommunicatethedatawithoutlossofdataorspeed.Thisbufferideahashistoricallybeencalledamessagingsystem,nottoodissimilarfromamailboxsystematthepostoffice.Incomingmessagesareputintoasetoforganizedlocations,fromwherethetargetapplicationswouldreceivethemwhentheyareready.

Withhugeamountsofdatacominginfromdifferentsources,andmanymoreconsumingapplications,apoint-to-pointsystemofdeliveringmessagesbecomesinadequateandslow.Alternatively,incomingdatacanbecategorizedintocertaintopics,andstoredintherespectivelocationorlocationsforthosetopics.Insteadofdatabeingreceivedandheldinstorageforaspecifictargetapplication,nowthedatamaybeconsumedbyanyapplicationthatisinterestedindatarelatedtoatopic.Eachconsumingapplicationcanchoosetoreaddataaboutoneormoretopicsofitsinterest.Thisiscalledthepublish-and-subscribesystem.

MessagingSystemsAMessagingSystemisanasynchronousmodeofcommunicatingdatabetweenapplication.Therearetwogenerickindsofmessagingsystems−apoint-to-pointsystem,andapublish-subscribe(pub-sub)system.Mostofthemessagingpatternsnowfollowpub-submodel.

PointtoPointMessagingSystem

Inapoint-to-pointsystem,everymessageisdirectedataparticularreceiver.Acommonqueuecanreceivemessagesfrommanyproducersormessages.Anyparticularmessagecanbereceivedandconsumedbyonlyonereceiver.Oncethattargetconsumerreadsamessageinthequeue,thatmessagedisappearsfromthatqueue.ThetypicalexampleofthissystemisanOrderProcessingSystem,whereeachorderwillbeprocessedbyoneOrderProcessor.

Publish-SubscribeMessagingSystem

Inapub-submessagingsystem,theapplicationspublishtheiroutputtoastandardmessagingqueue.Thetargetrecipientwillonlyneedtoknowwheretogetthemessage,wheneveritisreadytopickupthemessage.Applicationsthuscanignorethemechanicsofinteractionwithotherapplications,andsimplycareaboutthemessageitself.Thisisespeciallyvaluablewhentheremaybemanytargetrecipientsforamessage.Inapub-subsystem,messagesareenteredintothemessagingqueueasynchronouslyfromclientapplications.

Amessagequeuingsystemneedstobefastandsecuretoservemanyapplications,bothproducersandsubscribers.Messagesarealsoreplicatedacrossmultiplelocationsforreliabilityofdata.

TherearetwopopularDataingestingsystemsusedinBigData.Anoldersystem,calledFlume,iscloselytiedtotheHadoopdistributedfilesystem.ThenewandmorepopularsystemisageneralpurposesystemcalledApacheKafka.Inthischapterwewilldiscussthenewsystem,Kafka.

ApacheKafkaApacheKafkaisanopensourcepublish-and-subscribemessagebrokersystem.Kafkaaimstoprovideanintegratedhigh-throughput,low-latencymessagingplatformforhandlingreal-timedatafeeds.Intheabstract,itisasinglepointofcontactbetweenallproducersandconsumersofdata.AllproducersofdatasenddatatoKafka.AllconsumersofdatareaddatafromKafka.(Figure8.1)

Figure8‑1:Kafkacoreidea

Kafkaisadistributed,partitioned,scalable,replicatedmessagingsystem,withasimplebutuniquedesign.ItwasinitiallydevelopedbyLinkedInandwasopensourcedinearly2011.ApacheSoftwareFoundationisnowresponsibleforitsdevelopmentandimprovement.Kafkaisavaluableforanenterpriseslevelinfrastructurebecauseofitssimplicityandscalability.Kafkasystemiswritteninthehigh-levelScalaprogramminglanguage.

UseCasesFollowingaresomepopularusecasesofApacheKafka.

Messaging

KafkaisaverygoodalternativeforatraditionalmessagebrokerbecauseKafkamessagingsystemhasbetterthroughput,builtinpartitioning,replicationandbetterfaulttolerance.Kafkaisverygoodsolutionforalargescalemessageprocessingapplications.

WebsiteActivityTracking

WebsiteActivityTrackingwasoneofinitialusecasesforKafkaforLinkedIn.Users’onlineactivitytrackingpipelinewasrebuiltasasetofrealtimedatafeeds.Generalwebactivitytrackingincludesverylargevolumeofdata,andKafkaisverygoodathandlingthishugevolumeofdata.Useractivitytypessuchaspageview,searches,clicks,etccanbedesignatedascentraltopics,andtheactivitydatacanbepublishedtothosetopics.Thoseeventsareavailableforrealtimeorofflineprocessingandreporting.

StreamProcessing

PopularframeworkssuchasStormandSparkStreamingreaddatafromatopic,processit,andsendtootherusersandconsumerapplications.TheymayevenwriteitbacktoKafkatoanewtopic.Kafka’sstrongdurabilityisalsoveryusefulforstreamprocessing.

LogAggregation

ActivityLogaggregationtypicallygathersphysicallogfilesfromserversandputsthemallinacentralplaceforprocessing.Kafkacanabstractawaythedetailsofthefilesandprovideacleanerabstractionoflogdataasastreamofmessages.UseofKafkathenallowsforlower-latencyprocessingandeasiersupportformultipledatasourcesanddistributeddataconsumption.Unlikededicatedlog-centricsystems,Kafkaoffershigherperformanceandstrongerdurabilityguaranteesduetoreplication.

CommitLogKafkacanbeusedasexternalcommitlogforadistributeddatabasesystem.Thisauditlogcanhelptore-syncdatabetweenthefailednodestorestoretheirdata.ThelogcompactioninKafkahelpstoachievethisfeaturemoreefficiently.

KafkaArchitectureIntheabstract,Kafkabrokersdealwithproducersandconsumersofdata.Aproducerpushesdataintotheingestsystematitsownspeed,scaleandconvenience.Aconsumerpullsdataoutofthesystematitsownspeed,scaleandconvenience.Allthereceiveddataisorganizedbycategories,calledtopics.Incomingdataissortedandstoredintotopicservers.Theconsumersofdatacansubscribetooneormoretopics(Figure8.2).

Figure8‑2:KafkaEcosystem

Therearemorethanonebrokers(alsocalledservers,orpartitions)foreachtopic,forreliabilityofthemessagingsystem.Thustwoormorebrokerswillstoredataoneachtopic.Onlyonebrokercanbeleaderatanygiventime.Intheleadbrokerfails,thenasecondonecanautomaticallytakeoverandpreventthelossofaccesstodata.

Kafkaisdesignedfordistributedhighthroughputsystems.Incomparisontoothermessagingsystems,Kafkahasbetterthroughput,built-inpartitioning,replicationandinherentfault-tolerance,whichmakesitagoodfitforlarge-scalemessageprocessingapplications.Ithastheabilitytohandlealargenumberofdiverseconsumers.ItintegratesverywellwithApacheStorm,Sparkandotherreal-timestreamingdataapplications.Kafkaisveryfastandcanperform2millionwrites/sec.Italsoguaranteeszerodowntimeandzerodataloss.

TherearealotofcontributingorganizationshelpingtoimprovetheKafkaopen-sourcesystem.Ithasverywelldocumentedonlineresources.IthasbeenusedbymanybigorganizationssuchasLinkedIn,CiscoSystem,Spotify,Paypal,HubSpot,Shopify,Uberandmore.HubSpotusesKafkatodeliverrealtimenotificationofwhenarecipientopenstheiremail.PaypalusesKafkatoprocessmillionsofupdatesinaminute.

Producers

Aproducerisresponsibleforselectingthepartition,andthetopicforthemessagethatitwantstoconvey.Itcanuseround-robinalgorithmtobalancetheloadamongpartitions.Therecanbebothsynchronousandasynchronousproducersforproducingmessageandpublishingtothepartition.

Consumers

Aconsumerisresponsibleforreadingthedataaboutthetopicthatithassubscribed.Theconsumerisresponsibleforreadingthedatawithinareasonableperiodoftime,beforethe

queuesareemptiedforefficientmanagementofstorage.Differentconsumingapplicationscanreadthedataatdifferenttimes.Kafkahasstrongerorderingguaranteesthanatraditionalmessagingsystem.Aconsumerneedstoknowhowfarithasreadinthatqueue,soastoavoidduplicatesorlosesomedata.

Broker

AbrokerisaserverinaKafkacluster.Theclustermayhavemanysuchserversorbrokers.

Topic

Atopicisacategoryintowhichmessagesarepublished.Foreachtopicthereisaseparatepartitionlogforstorageofmessages.Eachpartitionhasanorderedsequenceofmessagesforthattopic.Eachmessageinthepartitionisassignedauniquesequentialnumber,alsocalledtheoffset.Thisoffsethelpstoidentifyeachmessagewithinthepartition.

Theconsumerreadsthedatasequentiallyaccordingtooffsetnumbers.Theconsumermaintainstheoffsettorememberhowfarithasread.Generally,theoffsetincreaseslinearlyasmessagesareconsumed.However,aconsumercanresetoffsettoaccessthedatagainandreprocessitasneeded.

TheKafkaclusterkeepsallthepublishedmessageswhetherornottheyhavebeenconsumedforaconfigurableperiodoftimeornot.Forexample,ifthelogretentionissettosevendays,thenforthesevendaysafterpublishing,themessageisavailableforconsumption.Aftersevendays,Kafkadiscardsthemessagestofreeupspace.

Kafka’sperformanceisnotaffectedbythesizeofdata.Eachpartitionmustfitontheserversthathostit,butatopicmayhavemultiplepartitions.ThisenablesKafkatomanageanarbitraryamountofdata.Also,itactsastheunitofparallelism.

SummaryofKeyAttributes1. Diskbased:Kafkaworksonaclusterofdisks.Itdoesnotkeepeverythingin

memory,andkeepswritingtothedisktomakethestoragepermanent.2. Faulttolerant:DatainKafkaisreplicatedacrossmultiplebrokers.Whenany

leaderbrokerfails,afollowerbrokertakesoverasleaderandeverything

continuestoworknormally.3. Scalable:Kafkacanscaleupeasilybyaddingmorepartitionsormorebrokers.

Morebrokershelptospreadtheloadandthisprovidesgreaterthroughput.4. Lowlatency:Kafkadoesverylittleprocessingonthedata.Thusithasverylow

latencyrateMessagesproducedbytheconsumerarepublishedandavailabletotheconsumerwithinafewmilliseconds.

5. FiniteRetention:Kafkabydefaultkeepsthemessageintheclusterforaweek.Afterthatthestorageisrefreshed.Thusthedataconsumershaveuptoaweektocatchupondata,incasetheyfallbehindforanyreason.

Distribution

TheKafkaclustermaintainsmultipleserversoverthedistributednetwork.Thepartitionsofthelogaremaintainedoverthisnetwork.Eachserverhandlesdataandrequestsforashareofthepartitions.Eachpartitionisreplicatedacrossaconfigurablenumberofserversforfaulttolerance.Butoneoftheserverforeachpartitionactsasthemainserveralsocalled“leader”whileitmayormaynothaveoneormoresecondaryserveralsoknownas“followers”.Theleaderserverisresponsibleforhandlingallthereadandwriteoperationforthepartitionwhilethefollowerssilentlyreplicatestheleader.Thefollowerserverbecomesveryhelpfulwhentheleaderserverfails.Thefollowerserverautomaticallybecomestheleaderandthenhandlesthefailure.Oneservercanbealeaderforsomeofthepartitionsonit,whileitmaybefollowerforotherpartitions.Thusoneservercanactasbothleaderandfollower.Thishelpstobalancetheworkloadontheserverswithinthecluster.

Guarantees

Messagessentalwaysmaintaintheordertheyweresent.Forexample,ifamessageM1andM2weresentbythesameproducerandM1wassentfirstthenthemessageM1willhaveloweroffsetthanmessageM2.Therefore,M1willalwaysappearbeforetheM2fortheconsumer.

EachtopichasareplicationfactorNandthesystemcantolerateuptoN-1serverfailureswithoutlosinganymessagescommittedtothelog.

ClientLibraries

Kafkasupportsfollowingclientlibraries:

1. Python:PurepythonimplementationwithfullprotocolsupportandConsumerProducerarealsoincluded.

2. C:HighperformanceClibrarywithfullprotocolsupport.3. C++,Ruby,Javascriptandmore.

ApacheZooKeeperKafkaisbuiltontopofZooKeeper.ApacheZookeeperisadistributedconfigurationandsynchronizationserviceinHadoopclusters.HereitservesasthecoordinationinterfacebetweentheKafkabrokersandconsumers.TheKafkaserversstoresbasicmetadatainZookeeperandsharesinformationabouttopics,brokers,andconsumeroffsets(queuereaders)andsoon.

SinceZookeeperdoesitownlayersofreplication,thefailureofaKafkabrokerdoesnotaffectthestateoftheKafkacluster.EvenifZookeeperfails,Kafkawillrestorethestate,oncetheZookeeperrestarts.ThisgiveszerodowntimeforKafka.Zookeeperalsomanagesthealternativeleaderbrokerselection,incaseofaKafkaleaderfailure.KafkaProducerexampleinJava

//Configure

Propertiesconfig=newProperties();

config.setProperty(ProducerConfig.BOOTSTRAP_SERVER_CONFIG,“localhost:8082”);

KafkaProducerproducer=newKafkaProducer(config);

ProducerRecordrecord=newProducerRecord(“topic”,“key”.getBytes(),”value”.getBytes());

Future<RecordMetaData>response=producer.send(record);

ConclusionBigdataisingestedusingadedicatedsystem.Theseoftentaketheformofmessagingsystems.Publish-and-subscribesystemsareefficientwaysofdeliveringdatafrommanysourcestomanytargets,inareliable,secureandefficientway.Kafkaisanopen-source,reliable,secure,andscalabledatapublish-subscribemessagingsystem.Itdealswithproducersaswellasconsumersofdata.Messagesarepublishedtoasetofcentraltopics.Eachconsumercansubscribetoanynumberoftopics.Kafkausesaleader-followersystemofmanagingreplicatedpartitionsforthesamesetofdata,toensurefullreliabilityandzerodowntime.

ReviewQuestionsQ1:Whatisadataingestsystem?Whyisitanimportanttopic?

Q2:Whatarethetwowaysofdeliveringdatafrommanysourcestomanytargets?

Q3:WhatisKafka?Whatareitsadvantages?Describe3usecasesofKafka.

Q4:Whatisatopic?Howdoesithelpwithdataingestmanagement?

References1.http://kafka.apache.org/documentation.html#introduction

http://kafka.apache.org/documentation.html#introduction

Chapter9–CloudComputingPrimerCloudcomputingisacost-effectiveandflexiblemodeofdeliveringITinfrastructureasaservicetoclients,overinternet,onameteredbasis.ThecloudcomputingmodeloffersclientsenormousflexibilitytouseasmuchITcapacity–compute,storage,network–asneededwithouthavingtoinvestinadedicatedITcapacityonone’sown.TheITusagecanbescaledupordowninminutes.ThecomplexITinfrastructuremanagementskillsareallownedbythecloudcomputingprovider,andproblemscanberesolvedmuchfaster.TheclientcansimplyaccessasmoothlyrunningITinfrastructureoverafastinternetconnection.ITcapacityinthecloudcanbepurchasedasacustompackagedependinguponone’sneedsintermsofaverageandpeakITrequirements.Thecomputingcloudistheultimatecosmiccomputeralignedwithalllawsofnature.

IntroductionManagingverylargeandfastdatastreamsisahugechallenge.Itrequiresmakingcriticaldecisionsaboutitsstorage,structure,andaccess.Thisdatawouldbestoredinlargeclustersofhundredsorthousandsofinexpensivecomputers.Suchclustersareoftencalledserverfarms.Thelocationandsizeofsuchclustersimpactscosts.Theserverfarmsmaybelocatedintheirowndatacenters,ortheymayberentedfromspecializedthird-partyorganizationscalledcloudcomputingserviceproviders.

CloudcomputingprovidestheITleadershipacost-effectiveandpredictablesolutionforreliablymeetingtheirlargedatamanagementneeds.Therearemanyvendorsofferingthisservice.Priceskeepdroppingregularly,becauseITcomponentskeepgettingcheaper,thereisgrowingvolumeofbusiness,andthereiseffectivecompetition.Withcloudcomputing,theITexpensebecomesanoperatingexpenseratherthanacapitalexpense.ThecostsofITbecomesalignedwithrevenuestreamsandmakescashflowmanagementeasier.

Oneofthemainreasonsforenterprisesmovingtocloudcomputingistoexperimentwithnewandriskyprojects.Thisflexiblemodelmakesitmucheasiertolaunchnewproductsandservices,withoutbeingexposedtotheriskofaheavylossinITinfrastructure.Forexample,anewHollywoodmovie’ssitewillhavemillionsofvisitorstoitswebsiteforamonthbeforeandforamonthafterthemovie’sreleasedate.Afterthatthevisitstothewebsitewilldropdramatically.Thewebsiteownerwouldbenefitenormouslyfromusingacloudcomputingmodelwheretheypayforthepeakwebusagecapacityforthosefewmonths,andmuchlessastheusagedropsdown.Moreimportantly,theflexibilityensuresthattheirwebsitewillnotcrashjustincasethemoviebecomesasuper-hitandattractsunusuallylargenumberofvisitorstothewebsite.

CloudComputingCharacteristicsHerearethemajorcharacteristicsofacloudcomputingmodel.

1. FlexibleCapacity:Thecapacitycanscaleuprapidly.Onecanexpandandreduceresourcesaccordingtoone’sspecificservicerequirements,asandwhenneeded.Thecloudinternallydoesregularworkloadbalancingamongtheneedsofmillionsofclients,andthishelpsbringdowncostsforeveryone.

2. Attractivepaymentmodel:Cloudcomputingworksonapay-per-usemodel.i.e.onepaysonlyforwhatoneuses,andforhowlongoneusesit.ITcostsbecomeanexpenseratherthanacapitalexpensefortheclient.Theresourcepricesmaybenegotiatedatlong-termcontractrates,andcanalsobepurchasedatspotmarketrates.

3. ResiliencyandSecurity:Thefailureofanyindividualserverandstorageresourcesdoesnotimpacttheuser.TheServersandstorageforallclientsareisolatedtomaximizesecurityofdata.

In-housestorageMostorganizationshavedatacentersforrunningtheirregularIToperations.Anorganizationmaydecidetoexpanditsowndatacentertostorelargestreamsofdata.Theorganizationcanensurecompletesecurityandprivacyofitsdataifitkeepsallthedatain-house.However,thecostsandcomplexityofmanagingthisdataareincreasing,anditisnotcost-effectiveforeveryorganizationtomanagehugedatacenters.Hiringandretainingscarceadvancedskillstomanagesuchdatacenterswouldalsobeachallenge.

CloudstorageItisnowbecomingatrendfororganizationstochoosetostoretheirdatainmassivedatacentersownedbyotherspecializedcompanies.Theirdataandprocessingcapacityresidesinsomesortofahugecloudoutthere,whichisaccessiblefromanywhereanytimethroughasimpleinternetconnection.

CompanieslikeAmazon,Google,Microsoft,Apple,andIBMareamongthemajorprovidersofcloudstorageandcomputingservicesaroundtheworld.Theyownandoperatedatacenterswithmillionsofcomputersinthem.

Figure0‑1:Acloudcomputingdatacenter

Commercially,cloudserviceprovidersareabletoconsolidatetherequirementsofthousandsormillionsofcustomers,andsupplyflexibleamountsofdatastorageandcomputingfacilityavailabletoclientsonaper-usagebasis.Thispaymodelissimilartohowelectricutilitycompanieschargeconsumersfortheirusageofelectricityinhomesandoffices.Cloudcomputingoffersmuchlowercostsperuse,justlikeusingtheelectricutilitycostsmuchlessthanowningandoperatingone’sownelectricitygenerators.

http://www.cisco.com/c/dam/en_us/about/ac123/ac147/images/ipj/ipj_12-3/123_cloud_fig01_lg.jpg

Amajordisadvantageofcloudstorageisthatthedataisstoredawayfromone’sphysicalcontrol.Thussecurityofpreciousdataislefttothehandsofthecloudcomputingprovider.Whilethesecurityprotocolsarerapidlyimproving,however,therearenofailsafemethodsforsecuringdatainthecloud.Thereisalsoariskofbeinglockedintooneprovider’sinfrastructure.Thecost-benefittradeoffshavedefinitelytiltedtowardsusingcloudcomputingproviders.Atsomefuturepointintime,thecloudservicesprovidersmightbeheavilyregulatedliketheelectricutilities.

CloudComputing:EvolutionofVirtualizedArchitectureCloudcomputingisessentiallyacommercialmodelforvirtualizedserverinfrastructure.IBMbegantooffertime-sharingservicesonitsmainframecomputersbeginninginthe1960s.Nowthatsametechnologyhasbeenofferedonnetworksofsmallmachinesthroughthevirtualizationprocess.

Virtualizationassumesthatlogicalmachinescanbedifferentiatedfromphysicalmachines.AphysicalservercouldrunmultipleVirtualMachines(VMs);andonevirtualmachinemayspanmultiplephysicalservers.Thevirtualizationsoftwareiscalledahypervisor.ItabstractsallmachinesintoVirtualMachines,usingeasyGUIinterface.Avirtualizationsoftwarecantypicallyrunonaheterogeneousphysicalinfrastructure,andconvertallITcapacityintoasingleunifiedcapacity.Thiscapacitycanthenbeprovisionedinslicesandpackages.Theuserapplicationsarenotawarethattheyarerunninginavirtualizedenvironment;sotheyrunasifrunningonadedicatedmachine.Theapplicationscanalsorunontopoftheirownnativeoperatingsystems.

http://www.cisco.com/c/dam/en_us/about/ac123/ac147/images/ipj/ipj_12-3/123_cloud_fig02_lg.jpg

CloudServiceModelsTherearetwomajordimensionstoconceptualizetheCloudcomputingmodels:thescopeofservicesreceived;andthecontroloverandcostofthoseservices.

1. Therangeofcloudcomputingservicesfromacloudcomputingprovider,fallinthreebroadbuckets:

1. Infrastructureasaservice:Thisisthelowestlevelofservices,andincludedonlyrawcapacityofcompute,storage,andnetworking.Thepriceforthisservicesisthelowest.

2. Platformasaservice:ThisincludesIaaS,alongwithothertechnologiesandservices.ThesearestillverygeneraltoolssuchopensourceHadooporSparkorCassandraimplementation,alongwithcertainmonitoringtools.Thecostsarealittlehigherbecauseoftheadditionalmanagementandmonitoringservicesprovidedbytheprovider.

3. Softwareasaservice:Thisincludesthecomputingplatformaswellasbusinessapplicationsthatgetworkdone.Forexample,salesforce.comwasoneofthefirstCRMapplicationsoldonlyonaSaaSmodel.Googlesellsanemailservicetoorganizationsonaper-user-per-monthbasis.Thisisalsothemostexpensivetypeofcloudservice.

2. Theotherwaythecloudservicesdifferisintermsoftheownershipandcontrol.1. Publiccloud:Thiswillbealargesharedinfrastructuremadeavailableto

oneandall,inalow-costandmulti-tenancymodel.Theclientcanaccessitusinganydevice.Thedownsideisthatthedataalsoresidesonthecloud,andthuscouldbevulnerabletotheftorhacking.Thecoststo

clientarelow,andvariabledependinguponuse.2. Privatecloud:Thisisacloudversionofanin-houseITinfrastructure.

Theorganizationwillhaveexclusivecontrolovertheentireinfrastructure.Thecostswouldbefixedandhigher.

3. Hybridcloud:Thisisamixofflexibilityofcapacity,andmuchcontroloversomekeyaspectsofit.Onecouldretaincompletecontrolovercriticalapplications,whileusingsharedinfrastructurefornon-criticalapplications.

Alllevelsofinfrastructureandpaymodelsareuseful,astheyserverdifferentlevelsofneedsforclientorganizations.However,mostofthegrowthincloudcomputingishappeningbecauseoftheattractivenessofthelowcostofthepubliccloudmodel.

CloudComputingMythsThereareacoupleofmisconceptionsaboutthecostsandbenefitsofcloudcomputing.

1. Myth:PublicCloudcomputingwouldsatisfyalltherequirement:scalability,flexibility,payperuse,resilience,multitenancy,andsecurity.Dependinguponthetypeofserviceselected(SaaS,IaaS,orPaaS),theservicecansatisfyspecificsubsetsoftheserequirements.

2. Myth:CloudcomputingwouldbeusefulonlyifyouareoutsourcingyourITfunctionstoanexternalserviceprovider.OnecoulduseaprivatecloudcomputingmodelforasectionofITapplicationstoofferon-demand,scalable,andpay-per-usedeploymentswithinyourenterprise’sowndatacenter.

CloudComputing:GettingStartedHerebelowisaframeworkforcloudadoption.Learnmoreaboutthecontextforgettingbenefitsfromcloudcomputing.Selecttherightmodelandlevelofcloudcapacity.Setuptheapplicationsandamonitoringsystemforthoseapplicationandthetotalcloudfootprint.Chooseaserviceprovider,sayAmazonWebServices,theleadingproviderofcloudcomputing.UseAppendixAtoinstallHadooponAWSEC2publiccloud

infrastructure.

ConclusionCloudcomputingisabusinessmodeltoprovideshared,flexible,cost-effectiveITinfrastructuretogetstartedquicklyonbuildinganapplication.ForBigDataapplications,itcanbeevenmoreattractivetotestthesystemusingrentedfacilities,beforemakingthedeterminationofinvestingindedicatedITinfrastructure.

ReviewQuestionsQ1:DescribeCloudComputingmodel.

Q2:Whataretheadvantagesofcloudcomputingoverin-housecomputing

Q3:DescribethetechnicalarchitectureforCloudcomputing.

Q4:Nameafewmajorprovidersofcloudcomputingservices.

Section3

ThissectioncoverstheotherrelevantconceptsandtutorialsforeffectivelymanagingandutilizingBigData.

Chapter10willbringallthetoolstogetherinacasestudyofdevelopingwebloganalyzer,asanexampleofausefulBigDataapplication.

Chapter11willcovertheoverallviewofDataMiningtoolsandtechniquestoextractbenefitfromBigData.

Appendix1showsstepbystep,thewaytoinstallaHadoopclusteronacloudcomputingplatform.

Appendix2isatutorialoninstallingandrunningSpark.

Chapter10–WebLogAnalyzerapplicationcasestudy

IntroductionAwebloganalyzerisanautomatedsoftwaretoolthathelpstoanalyzeandmakedecisionsonanumberofissuesregardingwebapplicationserverlogs.Anidealwebloganalyzerwouldanalyzeunlimitedstreamsofdataandhelpkeeptheentireuniverserunningsmoothlyandwithoutfault.Thiswouldbedonebyeliminatingtheneedformanuallyaccessingthelogs,automatingtheflowofinformation,andalertingthesystemadministratorasneeded.

Client-ServerArchitectureEveryweb-basedapplicationrunsonaclient-serverarchitecture.Clientsareentitiesthataccessservers,andserversareentitiesthatrespondtotheclientwithasolution.Alotofclientssimultaneouslytrytoaccessservers.Theserversmaybedatabaseserver,networkserver,theapplicationserver,oranyserverinthen-tierarchitecture.Foreachrequest,alogentryisgenerated.Thespeedofaccessrequestsdeterminedthestreamoflogentries.Thisleadstoapotentiallyhugelogovertime.Thelogcanbeprocessedasstreamofdata.Thislogcanalsobestoredontheserversforlateranalysis.

Logscanbeusedformonitoring,auditandanalysispurposes.Itcanhelpwitherrordiagnosticsincaseawebsitebecomessloworitgoesdown.Logscanbeanalyzedtodetecthackingactivity.Theycanalsobeanalyzedtosummarizethepopularityofwebpages,andthedistributionofthepagerequesters.Itcanhelpwithaccessvolumes,andforscalingupordowntheinfrastructure.

WebLoganalyzerTheloganalyzerreceivedstreaminglogsfromaserverlocation,andanalyzesmultiplethingsusingmanyalgorithmstogeneratethedesiredresults.Thesystemiscompletelyautomated.Thelogisproduced,anditisconsumedittomakereal-timereports.Itiseasytoimaginethemassivedataflowproducedbythelogintheserverenvironmentwhileitisalsobeinganalyzedsimultaneouslyontheadministratorside.`

Requirements

Thisisaloganalyzertoanalyzeawebapplicationhostedonaserver.Itisabusyapplicationownedbyabigcompany.Itreceivesmorethan15000webaccessrequestsperhour.Alltheaccessrequestsneedtobelogged,anddumpedtoHadoopFilesystemperiodically.Theanalyzerisrequiredtoingestreal-timelogdata,andfilteroutapartofdataforanalyzinganddumpingtoHDFS.Ithastodostreamingdataflowmanagementaswellasbatchprocessing.TheanalyzerneedstoprocessthedatabeforeitisdumpedintoHDFS,andalsoafteritisputintoHDFS.Thesystemadministratorsshouldbealertedinrealtimeaboutpossiblethreats,overloads,delays,potentialserrors,andanyotherdamages.Theresultsofalltheanalysesneedtobestoredinadatabaseforlaterpresentationinagraphicalformat.Theresultshavetobemadeavailableforanyperiodoftime,withoutanymissingtimevalues.Thelogdatahastobepreservedforfuturewithoutlosinganylogdata.

SolutionArchitecture

GetstreamingdatausingApacheFlume,andsendittoHDFS.UseApacheSparkfordataflowmanagementplatformandprocessingengine.StoretheresultsofanalysisinMongoDB.Thisisasafesolution,becausethedatagetsstoredintoHadoopclusterandisavailableforfuturerequirements,evenwhileitisbeinganalyzedinrealtime.Theresultsofreal-timeprocessingalsogointoMongoDB.

Fig10.1:WebLogAnalyzerArchitecture

BenefitsofthissolutionTheadvantagesofthissolutionare:

1. RealtimeloggingandanalysisdatageneratedonserverisstreameddirectlytoHDFSbyFlumeagentwithoutdelay.Everylogentrygeneratedovereverysinglepointoftimeisanalyzedandusedformonitoringanddecisionmaking,

2. Automaticloghandlingandstorage.LoadingdataintoHDFSnormallyrequiresmanuallyrunningcertainHadoopcommands.ThisloganalyzerusesaFlumeagentorsparkstreamingtohandlealldataonitsown,withoutanyexternallymanagedefforts.

3. Easyandconvenientimplementusingbuilt-inandeasy-to-customizemachinelearningalgorithmsinSpark.

4. Easyerrorhandling,serverrequesthandling,andoverallserverperformanceoptimization.Itmakesserversmarterbykeepingtrackofalmosteveryaspectsofserver.

TechnologystackThetechnologystackusedforthisapplicationisshownbelow.Abriefofeachcomponentfollows.

1. ApacheSparkv22. Hadoop2.6.0cdh53. ApacheFlume4. Scala,Java5. MongoDB6. RestFulWebservices7. FrontUItools8. LinuxShellScripts

ApacheSparkSparkisfastin-memory-basedclustercomputingtechnology,designedforfastandstreamingcomputation.ItisbuiltontopofHadoopandMapReducesystem,anditextendsMapReducemodeltousemoretypesofcomputation,whichincludesinteractivequeriesandstreamprocessing.Ithaslotoflibrariesandpackageslikemachinelearning(MLLib),graphcomputation(GraphX)etc.Itclaimstoexecute10to100timesfasterthanHadoopbecauseofitsin-memorycomputationmodel.ItalsosupportsmultiplelanguagessuchasScala,Python,Java,andR.

SparkDeployment

1. Standalone2. HadoopYARN3. SIMR:SparkinmapReduce//Mesos

ComponentsofSpark

SparkSql:DataabstractioncalledschemaRDD,whichprovidessupportforstructuredandsemi-structureddata.

SparkStreaming:IngestsdatainminibatchandperformRDDtransformationonthosemini-batches.StreamingdataanalyticsusingRDD

MLib(machinelearning):Itisadistributedmachinelearningframework,whichoperatesin-memoryathighspeed,andoffersmanyMLalgorithms.

GraphX:ThisdistributedgraphprocessingframeworkprovidesAPIformanygraphcomputationalgorithms.

SparkCore:Thisisageneralexecutionengineforsparkplatformuponwhichallotherfunctionalityisbuilt.Ittakescareoftaskdispatchingandscheduling,andbasicI/Ofunctionalities.

Spark-shell:Itisapowerfultooltoanalyzedatainteractively.Itisavailableonscalaandpython.Spark’sprimarydataabstractionisanin-memorycollectionsofitemscalledRDD.ItcanbecreatedfromHadoopinputformatslikeHDFS,andbytransformingexistingRDDsusingfiltersandmapsintonewRDDs.

ScriptingandProgrammingmodelusingSparkContext:OnecanuseanIDEtodevelopandtesttheanalyticscode.OnecanthencreateajartoruntheanalyticsusingHadooparchitecture.Thejarcanalsobesubmittedusingspark-submitutilitytotheSparkengine.Forexample:

spark-submit—classapache.accesslogs.ServerLogAnalyzer—master

*localScalaSpark/Scala1/target/scala-2.10/Scala1-assembly-1.0.jar>output.txt

HDFSHDFSisadistributedfilesystem,thatisatthecoreofHadoopsystem.

-Deployedonlowcostcommodityhardware

-Faulttolerant

-SupportsBatchProcessing

-Designedforlargedatasetorlargefiles

-Maintainscoherencethroughwriteoncereadmanytimes

-Movingcomputationtothelocationofthedata.

MongoDBItisdocument-orienteddatabase.ItcameintoexistenceasaNoSQLdatabase.

ApacheFlumeFlumeisanopensourcetoolforhandlingstreaminglogsordata.Itisadistributedandreliablesystemforefficientlycollecting,aggregatingandmovinglargeamountofdatafrommanydifferentsourcestoacentralizeddatastore.ItisapopulartooltoassistwithdataflowandstoragetoHDFS.Flumeisnotrestrictedtologdata.Thedatasourcesarecustomizablesoitmightbeanysourcelikeeventdata,trafficdata,socialmediadata,oranyotherdatasource.ThemajorComponentsofFlumeare:

-Event

-Agent

-DataGenerators

-CentralizedStores

OverallApplicationlogicThesystemreadsaccesslogsandpresentstheresultsintabularandgraphicalformtoendusers.Thissystemprovidesthefollowingmajorfunctions:

1. Calculatecontentsize2. CountResponsecode3. AnalyzerequestingIP-address4. ManageEndpoints

TechnicalPlanfortheApplicationTechnically,theprojectfollowsthefollowingstructure:

1. FlumetakesstreaminglogfromrunningapplicationserverandstoresinHDFS.Flumeusescompressiontostorehugelogfilestospeedupthedatatransferandforstorageefficiency.

2. ApacheSparkusesHDFSasinputsourceandanalyzesdatausingMLLib.ApacheSparkstoresanalyzeddatainMongoDB

3. RESTfuljavaservicepresentsJSONobjectsfetchingfromMongoDBandsendingtoFrontend.Graphicaltoolsareusedtopresentdata.

ScalaSparkcodeforloganalysisNote:ThisapplicationiswritteninScalalanguage.Belowistheoperativepartofthecode.VisitgithublinkbelowforthecompleteScalacodeforthisapplication.

//calculatessizeoflog,andprovidesmin,maxandaveragesize

//cachingisdoneforrepeatedlyusedfactors

defcalcContentSize(log:RDD[AccessLogs])={

valsize=log.map(log=>log.contentSize).cache()

valaverage=size.reduce(_+_)/size.count()

println(“ContentSize::Average::”+average+””+

”||Maximum::”+size.max()+“||Minimum::”+size.min())

}

//SendalltheresponsecodewithitsfrequencyofoccurrenceasOutput

defresponseCodeCount(log:RDD[AccessLogs])={

valresponseCount=log.map(log=>(log.responseCode,1))

.reduceByKey(_+_)

.take(1000)

println(s”““ResponseCodesCount:${responseCount.mkString(“[“,“,”,“]”)}”””)

}

//filtersipaddressesthathavemorethen10requestsinserverlog

defipAddressFilter(log:RDD[AccessLogs])={

valresult=log.map(log=>(log.ipAddr,1))

.reduceByKey(_+_)

.filter(count=>count._2>1)

//.map(_._1).take(10)

.collect()

println(“IPAddressesCount::${result.mkString(“[“,“,”,“]”)}”)

}}

SampleLogdataSampleInputData:

InputFields(selectedfields):

Certainfieldshavebeenomittedtomakethecodeclear.Theresponsecodehasbeencoloredinredasitisthebasisofthemajorreports.

1. ipAddress:String,2. dateTime:String,3. method:String,4. endPoint:String,5. protocol:String,6. responseCode:Long,7. contentSize:Long

SampleInputRowsofData:

64.242.88.10[07/Mar/2014:16:05:49-0800]“GET/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariablesHTTP/1.1”40112846

64.242.88.10[07/Mar/2014:16:06:51-0800]“GET/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2HTTP/1.1”2004523

64.242.88.10[07/Mar/2014:16:10:02-0800]“GET/mailman/listinfo/hsdivisionHTTP/1.1”2006291

64.242.88.10[07/Mar/2014:16:11:58-0800]“GET/twiki/bin/view/TWiki/WikiSyntaxHTTP/1.1”2007352

64.242.88.10[07/Mar/2014:16:20:55-0800]“GET/twiki/bin/view/Main/DCCAndPostFixHTTP/1.1”2005253

64.242.88.10[07/Mar/2014:16:23:12-0800]“GET/twiki/bin/oops/TWiki/AppendixFileSystem?template=oopsmore&param1=1.12&param2=1.12HTTP/1.1”20011382

SampleOutputofWebLogAnalysisContentSize::Average::10101||Maximum::138789||Minimum::0

ResponseCodesCount:[(401,113),(200,591),(302,1)]

IPAddressesCount::[(127.0.0.1,31),(207.195.59.160,15),(67.131.107.5,3),(203.147.138.233,13),(64.242.88.10,452),(10.0.0.153,188)]

EndPoints::[(/wap/Project/login.php,15),(/cgi-bin/mailgraph.cgi/mailgraph_2.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_0.png,12),(/wap/Project/loginsubmit.php,12),(/cgi-bin/mailgraph.cgi/mailgraph_2_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_1.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_0_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_1_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_3_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_3.png,12)]

IntermediatedataisstoredinHadoopFileSysteminCSVformat

Toseedetailedcode,visit:https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/scala/src/main/scala/com/databricks/apps/logs/chapter1/LogAnalyzer.scala

Thiswebloganalyzercanbeenhancedinmanyways.Forexample,itcananalyzehistoryoflogsfrompreviousyearsanddiscoverwebaccesstrends.Thisapplicationcanalsobemadetodiscarddataolderthan5yearsintopermanentandbackupstorage.

https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/scala/src/main/scala/com/databricks/apps/logs/chapter1/LogAnalyzer.scala

ConclusionandFindingsTherearemorethan100technologiesaroundApacheecosystem.MostbasicistheMapReducetechniqueusedbyHadoopengine.ManystacksareavailableontopofMapReduce.Itisimportanttoincorporatetherightsetsofelementstodeveloptherightstackfortheparticularlargescaledataanalytics.AfewawesometechnologieslikeHDFS,Spark,Hive,MongoDB,andFlume/Kafkaislikelytomakethebigdataapplicationpowerfulandworthy.

Itisalsousefultoexperimentwithmanyothertechnologiesduringthedevelopmentofthisloganalyzer.FlumeandKafkaaremostpowerfultoolstohandlestreamingdata.SparkhasitsownstreamingAPI,butit’snoteasytoincorporatewithHDFSstorage.DevelopingthisapplicationalsohelpstolearnLinuxbasedtasksandshellscriptsalongwithsomedatahandlingtoolslikeAWKandStreamEditor.

Thisapplicationreducesburdenofmanualhandlingoflogsondatabase,applicationorhistoryservers.Moreover,ithelpstopresentanalyzeddatainanimpressivewaythatleadstoeasydecisionmaking.ThisapplicationcameintodevelopmentafterdoingmuchresearchonbigdatatoolssuchasApacheSpark.Thatsavedalottimeandcostlater.Itwasdevelopedusingagiledevelopmentpractices.

ReviewQuestionsQ1.Describetheadvantagesofawebloganalyzer.

Q2.Describethemajorchallengesindevelopingthisapplication.

Q3:Checkoutthereferencesbelow.Identify3-4majorlessonslearnedfromthecodeandvideo.

Chapter10:DataMiningPrimer

Dataminingistheartandscienceofdiscoveringknowledge,insights,andpatternsindata.Itistheactofextractingusefulpatternsfromanorganizedcollectionofdata.Patternsmustbevalid,novel,potentiallyuseful,andunderstandable.Theimplicitassumptionisthatdataaboutthepastcanrevealpatternsofactivitythatcanbeprojectedintothefuture.

Dataminingisamultidisciplinaryfieldthatborrowstechniquesfromavarietyoffields.Itutilizestheknowledgeofdataqualityanddataorganizingfromthedatabasesarea.Itdrawsmodelingandanalyticaltechniquesfromstatisticsandcomputerscience(artificialintelligence)areas.Italsodrawstheknowledgeofdecision-makingfromthefieldofbusinessmanagement.

Thefieldofdataminingemergedinthecontextofpatternrecognitionindefense,suchasidentifyingafriend-or-foeonabattlefield.Likemanyotherdefense-inspiredtechnologies,ithasevolvedtohelpgainacompetitiveadvantageinbusiness.

Forexample,“customerswhobuycheeseandmilkalsobuybread90percentofthetime”wouldbeausefulpatternforagrocerystore,whichcanthenstocktheproductsappropriately.Similarly,“peoplewithbloodpressuregreaterthan160andanagegreaterthan65wereatahighriskofdyingfromaheartstroke”isofgreatdiagnosticvaluefordoctors,whocanthenfocusontreatingsuchpatientswithurgentcareandgreatsensitivity.

Pastdatacanbeofpredictivevalueinmanycomplexsituations,especiallywherethepatternmaynotbesoeasilyvisiblewithoutthemodelingtechnique.Hereisadramaticcaseofadata-drivendecision-makingsystemthatbeatsthebestofhumanexperts.Usingpastdata,adecisiontreemodelwasdevelopedtopredictvotesforJusticeSandraDayO’Connor,whohadaswingvoteina5–4dividedUSSupremeCourt.Allherpreviousdecisionswerecodedonafewvariables.Whatemergedfromdataminingwasasimplefour-stepdecisiontreethatwasabletoaccuratelypredicthervotes71percentofthetime.Incontrast,thelegalanalystscouldatbestpredictcorrectly59percentofthetime.(Source:Martinetal.2004)

GatheringandselectingdataTolearnfromdata,qualitydataneedstobeeffectivelygathered,cleanedandorganized,andthenefficientlymined.Onerequirestheskillsandtechnologiesforconsolidationandintegrationofdataelementsfrommanysources.

Gatheringandcuratingdatatakestimeandeffort,particularlywhenitisunstructuredorsemistructured.Unstructureddatacancomeinmanyformslikedatabases,blogs,images,videos,audio,andchats.Therearestreamsofunstructuredsocialmediadatafromblogs,chats,andtweets.Therearestreamsofmachine-generateddatafromconnectedmachines,RFIDtags,theinternetofthings,andsoon.Eventuallythedatashouldberectangularized,thatis,putinrectangulardatashapeswithclearcolumnsandrows,beforesubmittingittodatamining.

Knowledgeofthebusinessdomainhelpsselecttherightstreamsofdataforpursuingnewinsights.Onlythedatathatsuitsthenatureoftheproblembeingsolvedshouldbegathered.Thedataelementsshouldberelevant,andsuitablyaddresstheproblembeingsolved.Theycoulddirectlyimpacttheproblem,ortheycouldbeasuitableproxyfortheeffectbeingmeasured.Selectdatacouldalsobegatheredfromthedatawarehouse.Everyindustryandfunctionwillhaveitsownrequirementsandconstraints.Thehealthcareindustrywillprovideadifferenttypeofdatawithdifferentdatanames.TheHRfunctionwouldprovidedifferentkindsofdata.Therewouldbedifferentissuesofqualityandprivacyforthesedata.

DatacleansingandpreparationThequalityofdataiscriticaltothesuccessandvalueofthedataminingproject.Otherwise,thesituationwillbeofthekindofgarbageinandgarbageout(GIGO).Thequalityofincomingdatavariesbythesourceandnatureofdata.Datafrominternaloperationsislikelytobeofhigherquality,asitwillbeaccurateandconsistent.Datafromsocialmediaandotherpublicsourcesislessunderthecontrolofbusiness,andislesslikelytobereliable.

Dataalmostcertainlyneedstobecleansedandtransformedbeforeitcanbeusedfordatamining.Therearemanywaysinwhatdatamayneedtobecleansed–fillingmissingvalues,reigningintheeffectsofoutliers,transformingfields,binningcontinuousvariables,andmuchmore–beforeitcanbereadyforanalysis.Datacleansingandpreparationisalabor-intensiveorsemi-automatedactivitythatcantakeupto60-80%ofthetimeneededforadataminingproject.

OutputsofDataMiningDataminingtechniquescanservedifferenttypesofobjectives.Theoutputsofdataminingwillreflecttheobjectivebeingserved.Therearemanywaysofrepresentingtheoutputsofdatamining.

Onepopularformofdataminingoutputisadecisiontree.Itisahierarchicallybranchedstructurethathelpsvisuallyfollowthestepstomakeamodel-baseddecision.Thetreemayhavecertainattributes,suchasprobabilitiesassignedtoeachbranch.Arelatedformatisasetofbusinessrules,whichareif-thenstatementsthatshowcausality.Adecisiontreecanbemappedtobusinessrules.Iftheobjectivefunctionisprediction,thenadecisiontreeorbusinessrulesarethemostappropriatemodeofrepresentingtheoutput.

Theoutputcanbeintheformofaregressionequationormathematicalfunctionthatrepresentsthebestfittingcurvetorepresentthedata.Thisequationmayincludelinearandnonlinearterms.Regressionequationsareagoodwayofrepresentingtheoutputofclassificationexercises.Thesearealsoagoodrepresentationofforecastingformulae.

Population“centroid”isastatisticalmeasurefordescribingcentraltendenciesofacollectionofdatapoints.Thesemightbedefinedinamultidimensionalspace.Forexample,acentroidcouldbe“middle-aged,highlyeducated,high-networthprofessionals,marriedwithtwochildren,livinginthecoastalareas”.Orapopulationof“20-something,ivy-league-educated,techentrepreneursbasedinSiliconValley”.Oritcouldbeacollectionof“vehiclesmorethan20yearsold,givinglowmileagepergallon,whichfailedenvironmentalinspection”.Thesearetypicalrepresentationsoftheoutputofaclusteranalysisexercise.

Businessrulesareanappropriaterepresentationoftheoutputofamarketbasketanalysisexercise.Theserulesareif-thenstatementswithsomeprobabilityparametersassociatedwitheachrule.Forexample,thosethatbuymilkandbreadwillalsobuybutter(with80percentprobability).

EvaluatingDataMiningResultsTherearetwoprimarykindsofdataminingprocesses:supervisedlearningandunsupervisedlearning.Insupervisedlearning,adecisionmodelcanbecreatedusingpastdata,andthemodelcanthenbeusedtopredictthecorrectanswerforfuturedatainstances.Classificationisthemaincategoryofsupervisedlearningactivity.Therearemanytechniquesforclassification,decisiontreesbeingthemostpopularone.Eachofthesetechniquescanbeimplementedwithmanyalgorithms.Acommonmetricforallofclassificationtechniquesispredictiveaccuracy.

PredictiveAccuracy=(CorrectPredictions)/TotalPredictionsSupposeadataminingprojecthasbeeninitiatedtodevelopapredictivemodelforcancerpatientsusingadecisiontree.Usingarelevantsetofvariablesanddatainstances,adecisiontreemodelhasbeencreated.Themodelisthenusedtopredictotherdatainstances.Whenatruepositivedatapointispositive,thatisacorrectprediction,calledatruepositive(TP).Similarly,whenatruenegativedatapointisclassifiedasnegative,thatisatruenegative(TN).Ontheotherhand,whenatrue-positivedatapointisclassifiedbythemodelasnegative,thatisanincorrectprediction,calledafalsenegative(FN).Similarly,whenatrue-negativedatapointisclassifiedaspositive,thatisclassifiedasafalsepositive(FP).Thisisrepresentedusingtheconfusionmatrix(Figure4.1).

ConfusionMatrix TrueClass

Positive Negative

PredictedClass

Predictedclass

Positive

TruePositive(TP)

FalsePositive(FP)

Negative

FalseNegative(FN)

TrueNegative(TN)

Figure10.1:ConfusionMatrix

Thusthepredictiveaccuracycanbespecifiedbythefollowingformula.

PredictiveAccuracy=(TP+TN)/(TP+TN+FP+FN).

Allclassificationtechniqueshaveapredictiveaccuracyassociatedwithapredictivemodel.Thehighestvaluecanbe100%.Inpractice,predictivemodelswithmorethan70%accuracycanbeconsideredusableinbusinessdomains,dependinguponthenatureofthebusiness.

TherearenogoodobjectivemeasurestojudgetheaccuracyofunsupervisedlearningtechniquessuchasClusterAnalysis.Thereisnosinglerightanswerfortheresultsofthesetechniques.Forexample,thevalueofthesegmentationmodeldependsuponthevaluethedecision-makerseesinthoseresults.

DataMiningTechniquesDatamaybeminedtohelpmakemoreefficientdecisionsinthefuture.Oritmaybeused to explore thedata to find interesting associativepatterns.Therighttechniquedependsuponthekindofproblembeingsolved(Figure10.2).

DataMiningTechniques

SupervisedLearning

(Predictiveabilitybasedonpastdata)

Classification–MachineLearning

DecisionTrees

NeuralNetworks

Classification-Statistics

Regression

UnsupervisedLearning

(Exploratoryanalysistodiscoverpatterns)

ClusteringAnalysis

AssociationRules

Figure10.2:ImportantDataMiningTechniques

Themostimportantclassofproblemssolvedusingdataminingareclassificationproblems.Classificationtechniquesarecalledsupervisedlearningasthereisawaytosupervisewhetherthemodelisprovidingtherightorwronganswers.Theseareproblemswheredatafrompastdecisionsisminedtoextractthefewrulesandpatternsthatwouldimprovetheaccuracyofthedecisionmakingprocessinthefuture.Thedataofpastdecisionsisorganizedandminedfordecisionrulesorequations,thatarethencodifiedtoproducemoreaccuratedecisions.

Decisiontreesarethemostpopulardataminingtechnique,formanyreasons.

1. Decisiontreesareeasytounderstandandeasytouse,byanalystsaswellasexecutives.Theyalsoshowahighpredictiveaccuracy.

2. Decisiontreesselectthemostrelevantvariablesautomaticallyoutofalltheavailablevariablesfordecisionmaking.

3. Decisiontreesaretolerantofdataqualityissuesanddonotrequiremuchdatapreparationfromtheusers.

4. Evennon-linearrelationshipscanbehandledwellbydecisiontrees.

Therearemanyalgorithmstoimplementdecisiontrees.Someofthepopular

onesareC5,CARTandCHAID.

Regressionisamostpopularstatisticaldataminingtechnique.Thegoalofregressionistoderiveasmoothwell-definedcurvetobestthedata.Regressionanalysistechniques,forexample,canbeusedtomodelandpredicttheenergyconsumptionasafunctionofdailytemperature.Simplyplottingthedatamayshowanon-linearcurve.Applyinganon-linearregressionequationwillfitthedataverywellwithhighaccuracy.Oncesucharegressionmodelhasbeendeveloped,theenergyconsumptiononanyfuturedaycanbepredictedusingthisequation.Theaccuracyoftheregressionmodeldependsentirelyuponthedatasetusedandnotatallonthealgorithmortoolsused.

ArtificialNeuralNetworks(ANN)isasophisticateddataminingtechniquefromtheArtificialIntelligencestreaminComputerScience.Itmimicsthebehaviorofhumanneuralstructure:Neuronsreceivestimuli,processthem,andcommunicatetheirresultstootherneuronssuccessively,andeventuallyaneuronoutputsadecision.Adecisiontaskmaybeprocessedbyjustoneneuronandtheresultmaybecommunicatedsoon.Alternatively,therecouldbemanylayersofneuronsinvolvedinadecisiontask,dependinguponthecomplexityofthedomain.Theneuralnetworkcanbetrainedbymakingadecisionoverandoveragainwithmanydatapoints.Itwillcontinuetolearnbyadjustingitsinternalcomputationandcommunicationparametersbasedonfeedbackreceivedonitspreviousdecisions.Theintermediatevaluespassedwithinthelayersofneuronsmaynotmakeanyintuitivesensetoanobserver.Thus,theneuralnetworksareconsideredablack-boxsystem.

ClusterAnalysisisanexploratorylearningtechniquethathelpsinidentifyingasetofsimilargroupsinthedata.Itisatechniqueusedforautomaticidentificationofnaturalgroupingsofthings.Datainstancesthataresimilarto(ornear)eachotherarecategorizedintoonecluster,whiledatainstancesthatareverydifferent(orfaraway)fromeachotherarecategorizedintoseparateclusters.Therecanbeanynumberofclustersthatcouldbeproducedbythedata.TheK-meanstechniqueisapopulartechniqueandallowstheuserguidanceinselectingtherightnumber(K)ofclustersfromthedata.Clusteringisalsoknownasthesegmentationtechnique.Ithelpsdivideandconquerlargedatasets.Thetechniqueshowstheclustersofthingsfrompastdata.Theoutputisthecentroidsforeachclusterandtheallocationofdatapointstotheircluster.Thecentroiddefinitionisusedtoassignnewdatainstancescanbeassignedtotheirclusterhomes.Clusteringisalsoapartoftheartificialintelligencefamilyoftechniques.

Associationrulesareapopulardataminingmethodinbusiness,especially

wheresellingisinvolved.Alsoknownasmarketbasketanalysis,ithelpsinansweringquestionsaboutcross-sellingopportunities.ThisistheheartofthepersonalizationengineusedbyecommercesiteslikeAmazon.comandstreamingmoviesiteslikeNetflix.com.Thetechniquehelpsfindinterestingrelationships(affinities)betweenvariables(itemsorevents).ThesearerepresentedasrulesoftheformX Y,whereXandYaresetsofdataitems.Aformofunsupervisedlearning,ithasnodependentvariable;andtherearenorightorwronganswers.Therearejuststrongerandweakeraffinities.Thus,eachrulehasaconfidencelevelassignedtoit.Apartofthemachinelearningfamily,thistechniqueachievedlegendarystatuswhenafascinatingrelationshipwasfoundinthesalesofdiapersandbeers.

MiningBigDataAsdatagrowslargerandlarger,thereareafewwaysinwhichanalyzingBigdataisdifferent.FromCausationtoCorrelation

Thereismoredataavailablethantherearetheoriesandtoolsavailabletoexplainit.Historically,theoriesofhumanbehavior,andtheoriesofuniverseingeneral,havebeenintuitedandtestedusinglimitedandsampleddata,withsomestatisticalconfidencelevel.Nowthatdataisavailableinextremelylargequantitiesaboutmanypeopleandmanyfactors,theremaybetoomuchnoiseinthedatatoarticulateandtestcleantheories.Inthatcase,itmaysufficetovalueco-occurrencesorcorrelationofeventsassignificantwithoutnecessarilyestablishingstrongcausation.FromSamplingtotheWhole

Poolingallthedatatogetherintoasinglebigdatasystemcanhelpdiscoverevents,thathelpbringaboutafullerpictureofthesituation,andhighlightthreatsoropportunitiesthatanorganizationfaces.Workingfromthefulldatasetcanenablediscoveringremotebutextremelyvaluableinsights.Forexample,ananalysisofthepurchasinghabitsofmillionscustomersandtheirbillionstransactionsattheirthousandsofstorescangiveanorganizationavast,detailedanddynamicviewofsalespatternsintheircompany,whichmaynotbeavailablefromtheanalysisofsmallsamplesofdatabyeachstoreorregion.FromDatasettoDatastream

Aflowingstreamhasaperishableandunlimitedconnotationtoit,whileadatasethasafinitudeandpermanenceaboutit.Withanygiveninfrastructure,onecanonlyconsumesomuchdataatatime.Datastreamsaremany,largeandfast.Thusonehastochoosewhichofthemanystreamsofdatadoesonewanttoengagewith.Itisequivalenttodecidingwhichstreamtofishin.Themetricsusedforanalysisofstreamstendtoberelativelysimpleandrelatetotimedimension.Mostofthemetricsarestatisticalmeasuressuchascountsandmeans.Forexample,acompanymightwanttomonitorcustomersentimentaboutitsproducts.Sotheycouldcreateasocialmedialisteningplatformthatwouldreadalltweetsandblogpostsabouttheminreal-time.Thisplatformwould(a)keepacountofpositiveandnegativesentimentmessageseveryminute,and(b)flaganymessagesthatmeritattentionsuchassendinganonlineadvertisementorpurchaseoffertothatcustomer.

DataMiningBestPracticesEffectiveandsuccessfuluseofdataminingactivityrequiresbothbusinessandtechnologyskills.Thebusinessaspectshelpunderstandthedomainandthekeyquestions.Italsohelpsoneimaginepossiblerelationshipsinthedata,andcreatehypothesestotestit.TheITaspectshelpfetchthedatafrommanysources,cleanupthedata,assembleittomeettheneedsofthebusinessproblem,andthenrunthedataminingtechniquesontheplatform.

Animportantelementistogoaftertheproblemiteratively.Itisbettertodivideandconquertheproblemwithsmalleramountsofdata,andgetclosertotheheartofthesolutioninaniterativesequenceofsteps.Thereareseveralbestpracticeslearnedfromtheuseofdataminingtechniquesoveralongperiodoftime.TheDataMiningindustryhasproposedaCross-IndustryStandardProcessforDataMining(CRISP-DM).Ithassixessentialsteps(Figure4.3):

1. BusinessUnderstanding:Thefirstandmostimportantstepindataminingisaskingtherightbusinessquestions.Aquestionisagoodoneifansweringitwouldleadtolargepayoffsfortheorganization,financiallyandotherwise.Inotherwords,selectingadataminingprojectislikeanyotherproject,inthatitshouldshowstrongpayoffsiftheprojectissuccessful.Thereshouldbestrongexecutivesupportforthedataminingproject,whichmeansthattheprojectalignswellwiththebusinessstrategy.Arelatedimportantstepistobecreativeandopeninproposingimaginativehypothesesforthesolution.Thinkingoutsidetheboxisimportant,bothintermsofaproposedmodelaswellinthedatasetsavailableandrequired.

Figure4.3:CRISP-DMDataMiningcycle

2. DataUnderstanding:Arelatedimportantstepistounderstandthedataavailableformining.Oneneedstobeimaginativeinscouringformanyelementsofdatathroughmanysourcesinhelpingaddressthehypothesestosolveaproblem.Withoutrelevantdata,thehypothesescannotbetested.

3. DataPreparation:Thedatashouldberelevant,cleanandofhighquality.It’simportanttoassembleateamthathasamixoftechnicalandbusinessskills,whounderstandthedomainandthedata.Datacleaningcantake60-70%ofthetimeinadataminingproject.Itmaybedesirabletocontinuetoexperimentandaddnewdataelementsfromexternalsourcesofdatathatcouldhelpimprovepredictiveaccuracy.

4. Modeling:Thisistheactualtaskofrunningmanyalgorithmsusingtheavailabledatatodiscoverifthehypothesesaresupported.Patienceisrequiredincontinuouslyengagingwiththedatauntilthedatayieldssomegoodinsights.Ahostofmodelingtoolsandalgorithmsshouldbeused.Atoolcouldbetriedwithdifferentoptions,suchasrunningdifferentdecisiontreealgorithms.

5. ModelEvaluation:Oneshouldnotacceptwhatthedatasaysatfirst.Itisbettertotriangulatetheanalysisbyapplyingmultipledataminingtechniques,andconductingmanywhat-ifscenarios,tobuildconfidenceinthesolution.Oneshouldevaluateandimprovethemodel’spredictiveaccuracywithmoretestdata.Whentheaccuracyhasreachedsomesatisfactorylevel,thenthemodelshouldbedeployed.

6. Disseminationandrollout:Itisimportantthatthedataminingsolutionispresentedtothekeystakeholders,andisdeployedintheorganization.Otherwisetheprojectwillbeawasteoftimeandwillbeasetbackforestablishingandsupportingadata-baseddecision-processcultureintheorganization.Themodelshouldbeeventuallyembeddedintheorganization’sbusinessprocesses.

ConclusionDataMiningislikedivingintotheroughmaterialtodiscoveravaluablefinishednugget.Whilethetechniqueisimportant,domainknowledgeisalsoimportanttoprovideimaginativesolutionsthatcanthenbetestedwithdatamining.Thebusinessobjectiveshouldbewellunderstoodandshouldalwaysbekeptinmindtoensurethattheresultsarebeneficialtothesponsoroftheexercise.

ReviewQuestions1. Whatisdatamining?Whataresupervisedandunsupervisedlearning

techniques?2. Describethekeystepsinthedataminingprocess.Whyisitimportant

tofollowtheseprocesses?3. Whatisaconfusionmatrix?4. Whyisdatapreparationsoimportantandtimeconsuming?5. Whataresomeofthemostpopulardataminingtechniques?6. HowisminingBigdatadifferentfromtraditionaldatamining?

Appendix1:HadoopInstallationonAmazonWebServices(AWS)ElasticComputeCluster(EC2)

CreatingClusterserveronAWS,InstallHadoopfromCloudEraTheobjectiveofthistutorialistosetupabigdataprocessinginfrastructureusingcloudcomputing,andHadoopandSparksoftware.

Step1:CreatingAmazonEC2Servers.

1. Openhttps://aws.amazon.com/2. ClickonServices3. ClickonEC2

YoucanseethebelowresultonceyouclickonEC2.Ifyoualreadyhaveaserveryoucanseethenumberofrunningservers,theirvolumeandotherinformation.

4. ClickonLaunchInstanceButton.

https://aws.amazon.com/

5. ClickonAWSMarketePlace6. TypeUbuntuinsearchtextbox.7. ClickonSelectbutton

8. Ubuntuisfreesoyoudon’thavetoworryabouttheservicepriceClickonContinuebutton.

9. ChooseGeneral.purposem1.largeandclickonNext:ConfigurareInstanceDetails(DonotchoosetheMicroInstancest1.microitisfreebutitwillnotabletohandletheinstallation.)

10. ClickonNext:AddStorage

11. Specifythevolumesize20GB(Defaultwillbe8butitwillnotsufficient)andClickonNext:TagInstance

12. Typethenamecs488-master(Thisisforlabeltoknowwhichoneismasterandslave)andclickonNext:SecurityGroup

13. Weneedtoopenourservertotheworldincludingmostoftheportcauseclouderaneedtoaddmoreport.SpecifythegroupnameType:ChooseCustomTCPRulePortRange0-65500Source:AnyWhereAndClickonReviewInstance

14. Themessageshowsthewarningthisisonlythatweopenourservertoworld,Soignoreitfornow.ClickonLaunchbutton.

15. TypethekeypairnameandClickonDownloadKeyPairbutton(rememberthelocationofdownloadedfileweneedthisfiletologintotheserver.)andClickonLaunchInstances.

16. Nowthemasterserveriscreated.

Now,weneedfourmoreserverstomaketheclusteringforthatwedon’tneedtodotheseprocessfourtimes.Wejustincreasethevalueofnoofinstanceweneedandwegotthe4servers.

Nowwearegoingtolaunch4moreserverwhichisslaves.

Pleaserepeatstep4-9

Gotoamazonmarketplace,chooseUbuntu,selecttheinstancetype(General.purpose)

17. Type4inNumberofInstances.Whichwillcreatethe4moreserverforus.

18. Nametheservercs488-slave

19. Selectthepreviouscreatedsecuritygroup.

20. Itisimportantthatyouneedtochoosetheexistingkeypairfortheseservertoo.

Ifeverythinggoeswell,youcanseehave5instances,5volumes,1keypair,1or2securitygroups.

Wearenowsuccessfullycreated5servers.

Step2:ConnectingserverandinstallingrequiredClouderadistributionofHadoop

Firstofalltakeanoteforallyourserverdetails,IPAddress,DNSaddress.Masterandslaves.

MasterPublicDNSAddress:ec2-54-200-210-141.us-west-2.compute.amazonaws.comMasterPrivateIPAddress:172.31.20.82

Slave1PrivateIP:172.31.26.245Slave2PrivateIP:172.31.26.242Slave3PrivateIP:172.31.26.243Slave4PrivateIP:172.31.26.244

Onceyouhavetheseinrecorded,youcanconnecttotheserver.Ifyouareusinglinuxasoperatingsystemyoucanusesshcommandfromterminaltoconnectit.

Connectingtheserver(Windows)

1. Downloadthesshsoftware(Putty)(http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html)Alsodownloadputtygentoconvertourauthenticationfile.pemto.ppk

http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

2. Openputtygenloadtheauthenticationfile

ClickonSavePrivateKey

3. OpenPuttytypethemasterpublicdnsaddressinhostnameandthanclickonSSHfromleftpanel>ClickonAuth>>Selecttherecentconvertedauthenticationfile(.ppk)andfinallyclickonOpenbutton.

4. Nowyouwillabletoconnecttheserverpleasetype“ubuntu”thedefaultusernametologintothesystem.

5. Onceyouconnecttypethefollowingcommandintotheterminal6. sudoaptitudeupdate7. cd/usr/local/src/8. sudowgethttp://archive.cloudera.com/cm4/installer/latest/cloudera-manager-

installer.bin9. sudochmodu+xcloudera-manager-installer.bin10. sudo./cloudera-manager-installer.bin

http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin

11. Thereis4morestepwhereyouclickonNextandYesforlicenseagreement.Onceyoufinishtheinstallationyouneedtorestarttheservice.

12. sudoservicecloudera-scm-serverrestart

Youarenowabletoconnecttheclouderafromyoubrowser.Theaddresswillbehttp://<YOURPUBLICDNSSERVER>:7180eg.http://ec2-54-200-210-141.us-west-2.compute.amazonaws.com:7180anddefaultusernameandpasswordisadmin/admintologintothesystem.

http://ec2-54-200-210-141.us-west-2.compute.amazonaws.com:7180

Oncerestarttheserveritwillopentheloginscreenagain.Thesameusernameandpassword(admin/admin)isusetologintothesystem.

13. ClickonLaunchtheClassicwizard

14. ClickonContinue

15. ProvideallthePrivateIPaddressofmasterandslavescomputersandclickonSearchbutton.

16. ClickonContinuebutton.

17. ChooseNoneforSOLR1….AndNoneforIMPAL….AndClickonContinuebutton.

18. ClickonAnotherUser>>Type“ubuntu”andselectAllhostsacceptsameprivatekey>>uploadtheauthenticationfile.pemandclickonContinuebutton.

19. Nowclouderawillinstallthesoftwareforeachofourserver.

20. Oncetheinstallationiscompleteclickoncontinuebutton.

21. Onceitreachto100%clickoncontinuebutton.Donotdisconnectinternetnorshutthemachine,Iftheprocesswillnotcompletethatweneedtore-createthewholeprocess.Clickoncontinuebutton.

22. ClickonContinue.

23. ChooseCoreHadoopandClickonInspectRoleAssignmentsbutton

24. NowforyoumasterIPitshouldhaveonlyNameNodeselectionanduncheckedinDataNode.Thisisimportanttomakethemasterandslaveserver.

25. Nowtheclouderawillinstallthealltheservicesforyoufutureuseyoucanrecordtheusernameandpasswordofeachservices.ClickonTestConnection

26. ClickonContinue

27. Nowalltheinstallationiscompleteyoucannowhave1masternode4datanode.

28. Youshouldseethedashboard.

Step3:WordCountusingMapReduce29. Nowlogintomasterserverfromputty.30. Runthefollowingcommand31. cd~/32. mkdircode-and-data33. cdcode-and-data34. sudowgethttps://s3.amazonaws.com/learn-hadoop/hadoop-infiniteskills-

richmorrow-class.tgz35. sudotar-xvzfhadoop-infiniteskills-richmorrow-class.tgz36. cddata37. sudo-uhdfshadoopfs-mkdir/user/ubuntu38. sudo-uhdfshadoopfs-chownubuntu/user/ubuntu39. hadoopfs-putshakespeareshakespeare-hdfs40. hadoopversion41. hadoopfs-lsshakespeare-hdfs

42. sudohadoopjar/opt/cloudera/parcels/CDH-4.7.1-

1.cdh4.7.1.p0.47/share/hue/apps/oozie/examples/lib/hadoop-examples.jarwordcountshakespeare-hdfswordcount-output

43. hadoopjar/opt/cloudera/parcels/CDH-4.7.1-1.cdh4.7.1.p0.47/share/hue/apps/oozie/examples/lib/hadoop-examples.jarsleep-m10-r10-mt20000-rt20000

Appendix2:SparkInstallationandTutorial

ThistutorialwillhelpinstallSparkandgetitrunningonastandalonemachine.ItwillthenhelpdevelopasimpleanalyticalapplicationusingRlanguage.

Step1:VerifyingJavaInstallation

JavainstallationisoneofthemandatorythingsininstallingSpark.TrythefollowingcommandtoverifytheJAVAversion.

$java-version

IfJavaisalready,installedonyoursystem,yougettoseethefollowingresponse−

javaversion“1.7.0_71”

Java(TM)SERuntimeEnvironment(build1.7.0_71-b13)

JavaHotSpot(TM)ClientVM(build25.0-b02,mixedmode)

IncaseyoudonothaveJavainstalledonyoursystem,thenInstallJavabeforeproceedingtonextstep.

Step2:VerifyingScalainstallation

VerifyScalainstallationusingfollowingcommand.

$scala-version

IfScalaisalreadyinstalledonyoursystem,yougettoseethefollowingresponse−

Scalacoderunnerversion2.11.6—Copyright2002-2013,LAMP/EPFL

Incaseyoudon’thaveScalainstalledonyoursystem,thenproceedtonextstepforScalainstallation.

Step3:DownloadingScala

DownloadthelatestversionofScalabyvisitthefollowinglinkDownloadScala.Forthistutorial,weareusingscala-2.11.6version.Afterdownloading,youwillfindtheScalatarfileinthedownloadfolder.

Step4:InstallingScala

FollowthebelowgivenstepsforinstallingScala.ExtracttheScalatarfile

TypethefollowingcommandforextractingtheScalatarfile.

$tarxvfscala-2.11.6.tgzMoveScalasoftwarefiles

UsethefollowingcommandsformovingtheScalasoftwarefiles,torespectivedirectory(/usr/local/scala).

$su–

Password:

#cd/home/Hadoop/Downloads/

#mvscala-2.11.6/usr/local/scala

#exit

SetPATHforScala

UsethefollowingcommandforsettingPATHforScala.

$exportPATH=$PATH:/usr/local/scala/binVerifyingScalaInstallation

Afterinstallation,itisbettertoverifyit.UsethefollowingcommandforverifyingScalainstallation.

$scala-version

IfScalaisalreadyinstalledonyoursystem,yougettoseethefollowingresponse−

Scalacoderunnerversion2.11.6—Copyright2002-2013,LAMP/EPFL

Step5:DownloadingSpark

DownloadthelatestversionofSpark.Forthistutorial,weareusingspark-1.3.1-bin-hadoop2.6version.Afterdownloadingit,youwillfindtheSparktarfileinthedownloadfolder.

Step6:InstallingSpark

FollowthestepsgivenbelowforinstallingSpark.ExtractingSparktar

Thefollowingcommandforextractingthesparktarfile.

$tarxvfspark-1.3.1-bin-hadoop2.6.tgzMovingSparksoftwarefiles

ThefollowingcommandsformovingtheSparksoftwarefilestorespectivedirectory(/usr/local/spark).

$su–

Password:

#cd/home/Hadoop/Downloads/

#mvspark-1.3.1-bin-hadoop2.6/usr/local/spark

#exit

SettinguptheenvironmentforSpark

Addthefollowinglineto~/.bashrcfile.Itmeansaddingthelocation,wherethesparksoftwarefilearelocatedtothePATHvariable.

exportPATH=$PATH:/usr/local/spark/bin

Usethefollowingcommandforsourcingthe~/.bashrcfile.

$source~/.bashrc

Step7:VerifyingtheSparkInstallation

WritethefollowingcommandforopeningSparkshell.

$spark-shell

Ifsparkisinstalledsuccessfullythenyouwillfindthefollowingoutput.

SparkassemblyhasbeenbuiltwithHive,includingDatanucleusjarsonclasspath

UsingSpark’sdefaultlog4jprofile:org/apache/spark/log4j-defaults.properties

15/06/0415:25:22INFOSecurityManager:Changingviewaclsto:hadoop

15/06/0415:25:22INFOSecurityManager:Changingmodifyaclsto:hadoop

15/06/0415:25:22INFOSecurityManager:SecurityManager:authenticationdisabled;

uiaclsdisabled;userswithviewpermissions:Set(hadoop);userswithmodifypermissions:Set(hadoop)

15/06/0415:25:22INFOHttpServer:StartingHTTPServer

15/06/0415:25:23INFOUtils:Successfullystartedservice‘HTTPclassserver’onport43292.

WelcometoSparkversion1.4.0

UsingScalaversion2.10.4(JavaHotSpot(TM)64-BitServerVM,Java1.7.0_71)

Typeinexpressionstohavethemevaluated.

Sparkcontextavailableassc

scala>

Hereyoucanseethevideo:

HowtoinstallSpark

Youmightencounter“filespecifiednotfounderror”whenyouarefirstinstallingSPARKstandalone:

https://www.youtube.com/watch?v=L5QWO8QBG5c

TofixthisyouhavetosetupyourJAVA_HOME

Step1:Start->run->commandprompt(cmd)

Step2:DeterminewhereisyourJDKislocated,bydefaultitisinyourC:\programfiles

Step3SelectyourJDKtouseinmycase,IwillusemyJDK_8

CopythedirectorytoyourclipboardandgotoyourCMD.Andpressenter.

Step4:AddittogeneralPATH

Andpressenter.

NowgotoyoursparkfolderandgotoBIN\spark_shell

Youhaveinstalledsparklet’strytouseit.

Step8:Application:WordCountinScala

NowwewilldoanexampleofwordcountinScala:

text_file=sc.textFile(“hdfs://…”)

counts=text_file.flatMap(lambdaline:line.split(”“))\

.map(lambdaword:(word,1))\

.reduceByKey(lambdaa,b:a+b)

counts.saveAsTextFile(“hdfs://…”)

NOTE:Ifyouareworkingonastand-aloneSpark:

Thiscounts.saveAsTextFile(“hdfs://…”)commandwillgiveyouanerrorofNullPointerException.

Solution:counts.coalesce(1).saveAsTextFile()

ForimplementingwordcloudwecoulduseRinoursparkconsole:

However,ifyouclickonSparkRstraightawayyouwillgetanerror.

Tofixthis:

Step1:Setuptheenvironmentvariables.

InthePATHVariableaddyourpath:Iadded->;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\sbin;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\bin

Step2:InstallRsoftwareandRstudio.ThenaddthepathofRsoftwarepathtothePATHvariable.

Iaddedthistomyexistingpath->;C:\ProgramFiles\R\R-3.2.2\bin\x64\(Remembereachpaththatyouaddmustbeseparatedbysemicolonandnospacesplease)

Step3:Runcommandpromptasanadministrator.

Step4:Nowexecutethecommand>“SparkR”fromthecommandprompt.Ifsuccessfulyoushouldseemessage“Sparkcontextisavailable…”asseen

below.IfyoupathisnotsetcorrectlyyoucanalternativelynavigatetothelocationwhereyouhavedownloadedSparkR.Inmycase(C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\bin)andexecute“SparkR”Command.

Step5:ConfigurationinsidetheRStudiotoconnecttoSpark!

ExecutethebelowthreecommandsinRstudioeverytime:

#HerewearesettingupSPARK_HOMEenvironmentvariable

Sys.setenv(SPARK_HOME=“C:/spark-1.5.1-bin-hadoop2.6/spark-1.5.1-bin-hadoop2.6”)

#Setthelibrarypath

.libPaths(c(file.path(Sys.getenv(“SPARK_HOME”),“R”,“lib”),.libPaths()))

#LoadingtheSparkRLibary

library(SparkR)

IfyouseethebelowmessagethenyouareallsettostartworkingwithSparkR

Nowlet’sStartCodinginR:

lords<-Corpus(DirSource(“temp/”))

Toseewhat’sinthatcorpus,typethecommand

inspect(lords)

Thisshouldprintoutcontentsonthemainscreen.Next,weneedtocleanitup.Executethefollowinginthecommandline,onelineatatime:

lords<-tm_map(lords,stripWhitespace)

lords<-tm_map(lords,tolower)

lords<-tm_map(lords,removeWords,stopwords(“english”))

lords<-tm_map(lords,stemDocument)

Thetm_mapfunctioncomeswiththetmpackage.Thevariouscommandsareself-explanatory:stripunnecessarywhitespace,converteverythingtolowercase(otherwisethewordcloudmighthighlightcapitalisedwordsseparately),removeEnglishcommonwordslike‘the’(so-called‘stopwords’),andcarryouttextstemmingforthefinaltidy-up.DependingonwhatyouwanttoachieveyoucouldalsoexplicitlyremovenumbersandpunctuationwiththeremoveNumbersandremovePunctuationarguments.

Itispossiblethatyoumaygeterrormessageswhilstexecutingsomeofthecommands,e.g.missingpackages.IfsoinstalltheseasoutlinedaboveinStep4,andrepeat

Ifalliswellthenyoushouldnowbereadytocreateyourfirstwordcloud!Trythis:

wordcloud(lords,scale=c(5,0.5),max.words=100,random.order=FALSE,rot.per=0.35,use.r.layout=FALSE,colors=brewer.pal(8,“Dark2”))

AdditionalResourcesHerearesomeotherbooks,papers,videoandotherresources,foradeeperdiveintothetopicscoveredinthisbook.

1. Mayer-Schonberger,Viktor;Cukier,Kenneth(2013).BigData:ARevolutionThatWillTransformHowWeLive,Work,andThink.HoughtonMifflinHarcourt.

2. McKinseyGlobalInstituteReport(2011).Bigdata:Thenextfrontierforinnovation,competition,andproductivity.Mckinsey.com

3. Silver,N.(2012).TheSignalandtheNoise:WhySoManyPredictionsFailbutSomeDon’t.PenguinPress.

4. MateiZahariaandet.Al.(2010).“ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing,”UniversityofCalifornia,Berkeley.OReilley.

5. SandyRyza,UriLasersonet.al(2014).“Advanced-Analytics-with-Spark”.OReilley.

Websites:

6. ApacheHadoopresources:https://hadoop.apache.org/docs/r2.7.2/7. ApacheHDFS:https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html8. HadoopAPIsite:http://hadoop.apache.org/docs/current/api/9. ApacheSpark:http://spark.apache.org/docs/latest/

10.https://www.biostat.wisc.edu/~kbroman/Rintro/Rwinpack.html11.http://robjhyndman.com/hyndsight/building-r-packages-for-windows/12.https://stevemosher.wordpress.com/ten-steps-to-building-an-r-package-under-windows/13.http://www.inside-r.org/packages/cran/wordcloud/docs/wordcloud14.https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html15.https://intellipaat.com/tutorial/spark-tutorial/16.https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces5017.https://en.wikipedia.org/wiki/NoSQL

https://hadoop.apache.org/docs/r2.7.2/

https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

http://hadoop.apache.org/docs/current/api/

http://spark.apache.org/docs/latest/

https://www.biostat.wisc.edu/~kbroman/Rintro/Rwinpack.html

http://robjhyndman.com/hyndsight/building-r-packages-for-windows/

https://stevemosher.wordpress.com/ten-steps-to-building-an-r-package-under-windows/

http://www.inside-r.org/packages/cran/wordcloud/docs/wordcloud

https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

https://intellipaat.com/tutorial/spark-tutorial/

https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces50

https://en.wikipedia.org/wiki/NoSQL

18.http://www.planetcassandra.org/what-is-apache-cassandra/19.http://www.datastax.com/nosql20.https://www.sitepen.com/blog/2010/05/11/nosql-architecture/21.http://nosql-database.org/22.http://webpages.uncc.edu/xwu/5160/nosqldbs.pdf

VideoResources

23.DougCuttingon‘Hadoopat10’:https://www.youtube.com/watch?v=yDZRDDu3CJo24.StatusofApachecommunity:https://www.youtube.com/watch?v=sOZnf8Nn3Fo.25.Spark2.0updatesshowinganicedemoacrossR,ScalaandSQL)usingtweetsandclustering.https://www.youtube.com/watch?v=9xSz0ppBtFg26.https://www.youtube.com/watch?v=VwiGHUKAHWM27.https://www.youtube.com/watch?v=L5QWO8QBG5c28.https://www.youtube.com/watch?v=KvQto_b3sqw29.https://www.youtube.com/watch?v=YW28qItH_tA

http://www.planetcassandra.org/what-is-apache-cassandra/

http://www.datastax.com/nosql

https://www.sitepen.com/blog/2010/05/11/nosql-architecture/

http://nosql-database.org/

http://webpages.uncc.edu/xwu/5160/nosqldbs.pdf

https://www.youtube.com/watch?v=yDZRDDu3CJo

https://www.youtube.com/watch?v=sOZnf8Nn3Fo

https://www.youtube.com/watch?v=9xSz0ppBtFg

https://www.youtube.com/watch?v=VwiGHUKAHWM

https://www.youtube.com/watch?v=L5QWO8QBG5c

https://www.youtube.com/watch?v=KvQto_b3sqw

https://www.youtube.com/watch?v=YW28qItH_tA

AbouttheAuthorDr.AnilMaheshwariisaProfessorofComputerScienceandInformationSystems,andtheDirectorofCenterforDataAnalytics,atMaharishiUniversityofManagement.Heteachescoursesindataanalytics,andhelpswithextractingdeepinsightsfromtheirdata.HeworkedinavarietyofleadershiprolesatIBMinAustinTX,andhasalsoworkedatmanyothercompaniesincludingstartups.

HehastaughtattheUniversityofCincinnati,CityUniversityofNewYork,UniversityofIllinois,andothers.HeearnedanElectricalEngineeringdegreefromIndianInstituteofTechnologyinDelhi,anMBAfromIndianInstituteofManagementinAhmedabad,andaPh.D.fromCaseWesternReserveUniversity.HeisapractitionerofTranscendentalMeditationtechnique.

Heistheauthorofthe#1bestsellerDataAnalyticsMadeAccessible.

HeblogsinterestingstuffonITandEnlightenmentatanilmah.com

Instructorscanreachhimforcoursematerialsatakm2030@gmail.com.Speakingengagementsarewelcome.

mailto:[email protected]