this report studies various aspects of data analytics and...

14
This report studies various aspects of Data Analytics and their application to practical situations. The major components include my SQL, web development, big data, prediction analytics and data integration. The partnership of Cymetrix software with Looker and Salesforce is also explored. The report also includes a Letter of Appreciation. PRACTICAL APPLICATIONS OF DATA ANAYLTICS THROUGH BIG DATA, PREDICTION ANALYTICS AND DATA INTEGRATION CYMETRIX SOFTWARE MEHER KOLHE

Upload: others

Post on 24-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

This report studiesvariousaspectsofDataAnalyticsandtheirapplicationtopractical situations. The majorcomponents include my SQL, webdevelopment, big data, predictionanalytics and data integration. ThepartnershipofCymetrixsoftwarewithLooker and Salesforce is alsoexplored.

The report also includes a Letter ofAppreciation.

PRACTICALAPPLICATIONSOFDATAANAYLTICSTHROUGH

BIGDATA,PREDICTIONANALYTICSANDDATA

INTEGRATION

CYMETRIXSOFTWARE

MEHERKOLHE

Page 2: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

2

Page 3: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

3

Tableofcontents: Introduction 4IntroductiontoDataAnalytics 5DataMininginBigData 6Hadoop 7Hadoopstack 8HowHadoopworks 9Hive 10Sparkandsharkstack 11MongoDB 12MySQL 13GoogleCharts 14Combinedworking 15DataLoading 16ETLProcess 17

Page 4: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

4

Introduction:CymetrixSoftwareisaconsultingservicescompanybasedoutofMumbai.Theyworkmajorly in 2 areas for consulting – CRM and Data Analytics. For CRM, they havepartnership with Salesforce. They also have a partnership with Looker for datavisualization. In their data analytics division, they work on Big Data, Predictionanalytics,Dataintegrationetc.IchosetointernatCymetrixasmyprojectonNoise-levelmeasurementinthePowaiareainNorthMumbairequiredmetoperformdataanalyticsatanadvancedlevel.Todothis,IneededprofessionalguidanceandhenceapproachedCymetrix.During my internship with Cymetrix Software, I studied various aspects of DataAnalytics and their application to practical situations. Working as a Data Analysisintern at the company has allowed me to utilize my SQL, web developing andstatistical skills toachieve the tasksassigned tome. Indoing this, I couldapply thisnewskillsettomypersonalprojectanddevelopadataabstraction.

Page 5: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

5

IntroductiontoDataAnalytics:Data analytics focuses on processing and performing statistical analysis on existingdatasets.Analystsconcentrateoncreatingmethodstocapture,process,andorganizedata to uncover actionable insights for current problems, and establishing the bestway to present this data. More simply, the field ofdata and analyticsis directedtowardsolvingproblemsforquestionsweknowwedon’tknowtheanswersto.Moreimportantly, it’s based on producing results that can lead to immediateimprovements.Data analytics also encompasses a few different branches of broader statistics andanalysis which help combine diverse sources of data and locate connections whilesimplifyingtheresults.

DataMininginBigData:This refers to techniques, tools, and research designed for automatically extractingmeaning from large repositoriesofdata generatedbydevicesor IOTs.Quiteoften,thisdataisextensive,fine-grained,andprecise.Adataanalyticsworkbenchisrequiredtosetupaworkflowtotakedifferenttypesofdatageneratedasasourcepipelinethedataandprocessaccordinglyaccordingtoitstypetoensurethatitisreadytobetransformedandloaded.It isalsoneededforanalyzingthe largeamountsofdatathat isgeneratedandthusderivesomeusefulresultsfromit.Datavisualizationisamodernbranchofdescriptivestatistics.Itinvolvesthecreationand study of the visual representation of data,meaning information that has beenabstracted insomeschematicform, includingattributesorvariablesfortheunitsofinformation.

Page 6: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

6

Hadoop:Apache Hadoop is an open-source software framework for storage and large-scaleprocessing of data-sets on clusters of commodity hardware. The Apache Hadoopframeworkiscomposedofthefollowingmodules:• Hadoop Common: Contains libraries and utilities needed by other Hadoop

modules

• HadoopDistributedFileSystem(HDFS):Adistributed_le-systemthatstoresdataoncommoditymachines,providingveryhighaggregatebandwidthacrossthecluster.

• HadoopYARN:Aresource-managementplatformresponsibleformanagingcompute resources in clusters and using them for scheduling of users'applications.

• HadoopMapReduce:Aprogrammingmodelforlargescaledataprocessing.HadoopStack:

TheHadoopTechnologyStackismadeofthefollowingcomponents:1.HDFS:TheHadoopDistributedFileSystemisthecustomized_lesystemmadefortheHadoopEcosystemwhichsupportslargeblocksizesandcoordinatesstoragebetweenmultipleDataNodes.2. MapReduce: Programming paradigm that allows for massive scalability acrosshundredsorthousandsofserversinaHadoopcluster3. Pig: Scripting Language that allows people using Hadoop to focus more onanalyzing large data sets and spend less time having towritemapper and reducerprograms4.Hive:HiveallowsSQLdeveloperstowriteHiveQueryLanguage(HQL)statementsthatarelikestandardSQLstatements;nowyoushouldbeawarethatHQLislimitedinthecommandsitunderstands.

Page 7: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

7

HowHadoopworks:

Hadoop and its various components fit together to ensure a fault-tolerant, durableand highly efficient model for storage and management of Big Data. Thesecomponentsare:

• Namenode:Namenodeisthenodewhichstoresthefilesystemmetadatai.e.whichfilemapstowhatblocklocationsandwhichblocksarestoredonwhichdatanode.TheNamenodemaintainstwo in-memorytables,onewhichmapsthe blocks to datanodes (one block maps to 3 datanodes for a replicationvalueof3)andadatanodetoblocknumbermapping.Wheneveradatanodereportsadiskcorruptionofablock,thefirsttablegetsupdatedandwheneveradatanodeisdetectedtobedead(becauseofanode/networkfailure)boththetablesgetupdated.

• Secondary Namenode: The secondary Namenode regularly connects to theprimary Namenode and keeps snapshotting the system metadata intolocal/remotestorage.Itdoessoatapoorfrequencyandshouldnotbeheavilyreliedon.

Page 8: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

8

• Datanode:Thedatanodeiswheretheactualdataresides.

PointstoNote:

1. AlldatanodessendaheartbeatmessagetotheNamenodeevery3

secondstosaythattheyarealive.

2. If theNamenodedoesnot receive aheartbeat fromadatanodefor10minutes,thenitconsidersthatdatanodetobedead/outofservice and initiates replication of blocks which were hosted onthatdatanodetobehostedonanotherdatanode.

3. Thedatanodescantalktoeachothertorebalancedata,moveandcopydataaroundandkeepthereplicationhigh.

4. When the datanode stores a block of information, itmaintains achecksumforitaswell.

5. ThedatanodesupdatetheNamenodewiththeblockinformation

periodicallyandbeforeupdatingverifythechecksums.

6. If the checksum is incorrect for a block i.e. there is a disk levelcorruption for that block, it skips that block while reporting theblockinformationtotheNamenode.

7.Inthisway,Namenodeisawareofthedisklevelcorruptiononthatdatanodeandtakesstepsaccordingly.

• NodeManager: This is a yarn daemon which runs on individual nodes and

receive updated information on resource containers from their individualdatanodesviabackgrounddaemons.differentresourcessuchasmemory,CPUtime, network bandwidth etc. are put into one unit called the ResourceContainer. she Node manager in turn ensures fault tolerance on the datanodesforanymapreducejobs

• ResourceManager: This is a yarn daemonwhichmanages the allocation of

resourcestothedifferent jobsapart fromcomprisingaschedulerwhich justtakes careof the scheduling jobswithoutworryingabout anymonitoringorstatusupdates

Hive:HiveisusedasasubstitutetoSQLfortheHadoopFileSystemwhichmakesiteasyforpeople acquaintedwith SQL to query data from it instead of having to learnMap-Reduce.

Page 9: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

9

SparkandSharkStack:

Spark:

• ApacheSparkisanopen-sourcedataanalyticsclustercomputingframework.Spark fits into the Hadoop open-source community, building on top of theHadoopDistributedFileSystem(HDFS).

• Spark is not tied to the two-stage MapReduce paradigm, and promisesperformance up to 100 times faster than Hadoop MapReduce, for certainapplications.

• FollowstheconceptofaResilientDistributedDataset(RDD),whichallowsto

transparentlystoredataonmemoryandpersistittodiscifit'sneeded.

• ProvidesMachineLearningLibraryMLLibforDataAnalyticsonthefly.Shark:

• Shark is a large-scale data warehouse system for Spark designed to becompatiblewithApacheHive.

• ItcanexecuteHiveQLqueriesupto100timesfasterthanHivewithoutany

modificationtotheexistingdataorqueries.

• Shark supports Hive's query language, metastore, serialization formats, anduser- defined functions, providing seamless integration with existing Hivedeployments.

• SharksupportsHiveUDFs,SerDeandeverymajor functionality inHivesome

ofwhicharenotsupportedinsimilarqueryenginessuchasImpala.

• SparkiswritteninScala,andhasitsownversionoftheScalainterpreter.

• So,itspreferabletouseScala.

Page 10: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

10

MongoDB:MongoDBisanopensource,document-oriented,NoSQLdatabasesystem.IntheedXplatform, the discussion form and some course related material is stored ascollections of JSON-like documents in a MongoDB database. We have used localMongoDb database system to store themongo dump _les retrieved from the edXdatapackage.MySQL:MySQLisarelationaldatabasemanagementsystem(RDBMS),andshipswithnoGUItoolstoadministerMySQLdatabasesormanagedatacontainedwithinthedatabases.ThedatafromMySQLdumpcanbeprocessedbyimportingintoHDFSusingSqoop.Sqoop:SqoopisatooldesignedtotransferdatabetweenHadoopandrelationaldatabases.We used Sqoop to import data from a relational database management system(RDBMS)suchasMySQLorOracleintotheHadoopDistributedFileSystem(HDFS).Sqoop automates most of the process, relying on the database to describe theschema for thedata tobe imported. SqoopusesMapReduce to import andexportthedata,whichprovidesparalleloperationaswellasfaulttolerance.

Page 11: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

11

MeritsofusingPythonthanR:Pythonisahigh-levellanguage.Thelanguageprovidesconstructsintendedtoenableclear programs on both a small and large scale. R is a free software programminglanguage and software environment for statistical computing and graphics. The Rlanguageiswidelyusedamongstatisticiansanddataminersfordevelopingstatisticalsoftware.

• ThemainadvantageofPythonoverRisthatit'sarealprogramminglanguagein theC family. It scaleseasily, so it's conceivable thatanythingyouhave inyoursandboxcanbeusedinproduction.

• TheJDBCconnectorofpythonismuchfasterthantheRhiveconnector.Thus,thetimingtakenbypythontoreturnthedatafromhiveislesserthanthatinR.

• Also,pythonhasalibrary"Cython"usedforimprovingcomputationalspeedin

python.Cythonallowsonetostaticallytypevariablese.g.cdefintideclaresito be an integer. This gives massive speedups, as typed variables are nowtreatedusinglow-leveltypesratherthanPythonvariables.

GoogleCharts:GoogleChartsprovidesaperfectwaytovisualizedataonyourwebsite.Fromsimpleline charts to complex hierarchical tree maps, the chart gallery provides severalready-to-usecharttypes.ThemostcommonwaytouseGoogleChartsiswithsimpleJavaScript thatyouembed inyourwebpage.You loadsomeGoogleChart libraries,listthedatatobecharted,selectoptionstocustomizeyourchart,andfinallycreateachart objectwith an id that you choose. Then, later in thewebpage, you create a<div>withthatidtodisplaytheGoogleChart.Charts are exposed as JavaScript classes, and Google Charts provides many charttypesforyoutouse.Thedefaultappearancewillusuallybeallyouneed,andyoucanalways customizea chart to fit the lookand feelof yourwebsite.Charts arehighlyinteractive and expose events that let you connect them to create complexdashboardsorotherexperiencesintegratedwithyourwebpage.ChartsarerenderedusingHTML5/SVGtechnologytoprovidecross-browsercompatibility (includingVMLfor older IE versions) and cross platform portability to iPhones, iPads and Android.Youruserswillneverhavetomesswithpluginsoranysoftware. Iftheyhaveawebbrowser,theycanseeyourcharts.All chart typesarepopulatedwithdatausing theDataTableclass,making iteasy toswitch between chart types as you experiment to _nd the ideal appearance. TheDataTable provides methods for sorting, modifying, and filtering data, and can bepopulateddirectlyfromyourwebpage,adatabase,oranydataprovidersupportingthe Chart Tools Data source protocol. (That protocol includes a SQL-like query

Page 12: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

12

language and is implementedbyGoogle Spread- sheets,Google FusionTables, andthirdpartydataproviderssuchasSalesforce.Youcaneven implement theprotocolonyourownwebsiteandbecomeadataproviderforotherservices.)Combinedworking:

DataLoading:Onchoosingthisoption,hegetsaformonwhichclientfillsupalltherequireddetailstoupload,updateorresetHDFS.Onclickingthesubmitbutton,dataissentthroughaDjango server to a functionwhich starts the process. Firstly, it is checkedwhethersomeotherprocessisalreadyrunningornot.If not, then current process starts by passing the required information to a pythonscriptwhichdependingon inputs startswith the cleaningof thedata stored in thepath givenbyuser followedby loadingontoHDFS. Before cleaning starts, it is alsoverified whether the path given contains appropriate data for processing. If datafound is not proper, then a message is sent from the script to the Django viewsshowing the details which is then shown to the user through a web interface andaskinghimforcorrectdetails.

Page 13: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

13

Ifaprocesswasalreadyrunning,thelatestinstructioncannotbeper-formedasthiscancauseHDFStobeininconsistentstate,Also,userinstructioncannotbemadetolostifnotexecuted.Tosolvethisissue,aqueuingmechanismisusedusingdjango-rq.Django-rq is a a queuingmechanism based on redis and rq. It allows us to createmultiplequeues(aredefinedinsettings.pyfile)outofwhichaspertheprogrammer`schoice,anyqueuecanbeusedwithoutaffectingotherqueuesandeasilyputjobsintothem. Jobs kept into queue are executed as and when they previous job getscompleted. Till then they are stored somewhere waiting for their turn. There areworkerswhich reads the request of enqueuing processes and runs them. These rqworkersneedtobeactivewhenrequestismadeotherwiseprocesseswillnotbeabletoexecute.

Page 14: This report studies various aspects of Data Analytics and ...meherkolhe.com/documents/cymetrix.pdf · • Hadoop MapReduce: A programming model for large scale data processing. Hadoop

14

ETLProcess: