this report studies various aspects of data analytics and...

This report studiesvariousaspectsofDataAnalyticsandtheirapplicationtopractical situations. The majorcomponents include my SQL, webdevelopment, big data, predictionanalytics and data integration. ThepartnershipofCymetrixsoftwarewithLooker and Salesforce is alsoexplored.

The report also includes a Letter ofAppreciation.

PRACTICALAPPLICATIONSOFDATAANAYLTICSTHROUGH

BIGDATA,PREDICTIONANALYTICSANDDATA

INTEGRATION

CYMETRIXSOFTWARE

MEHERKOLHE

3

Tableofcontents: Introduction 4IntroductiontoDataAnalytics 5DataMininginBigData 6Hadoop 7Hadoopstack 8HowHadoopworks 9Hive 10Sparkandsharkstack 11MongoDB 12MySQL 13GoogleCharts 14Combinedworking 15DataLoading 16ETLProcess 17

4

Introduction:CymetrixSoftwareisaconsultingservicescompanybasedoutofMumbai.Theyworkmajorly in 2 areas for consulting – CRM and Data Analytics. For CRM, they havepartnership with Salesforce. They also have a partnership with Looker for datavisualization. In their data analytics division, they work on Big Data, Predictionanalytics,Dataintegrationetc.IchosetointernatCymetrixasmyprojectonNoise-levelmeasurementinthePowaiareainNorthMumbairequiredmetoperformdataanalyticsatanadvancedlevel.Todothis,IneededprofessionalguidanceandhenceapproachedCymetrix.During my internship with Cymetrix Software, I studied various aspects of DataAnalytics and their application to practical situations. Working as a Data Analysisintern at the company has allowed me to utilize my SQL, web developing andstatistical skills toachieve the tasksassigned tome. Indoing this, I couldapply thisnewskillsettomypersonalprojectanddevelopadataabstraction.

5

IntroductiontoDataAnalytics:Data analytics focuses on processing and performing statistical analysis on existingdatasets.Analystsconcentrateoncreatingmethodstocapture,process,andorganizedata to uncover actionable insights for current problems, and establishing the bestway to present this data. More simply, the field ofdata and analyticsis directedtowardsolvingproblemsforquestionsweknowwedon’tknowtheanswersto.Moreimportantly, it’s based on producing results that can lead to immediateimprovements.Data analytics also encompasses a few different branches of broader statistics andanalysis which help combine diverse sources of data and locate connections whilesimplifyingtheresults.

DataMininginBigData:This refers to techniques, tools, and research designed for automatically extractingmeaning from large repositoriesofdata generatedbydevicesor IOTs.Quiteoften,thisdataisextensive,fine-grained,andprecise.Adataanalyticsworkbenchisrequiredtosetupaworkflowtotakedifferenttypesofdatageneratedasasourcepipelinethedataandprocessaccordinglyaccordingtoitstypetoensurethatitisreadytobetransformedandloaded.It isalsoneededforanalyzingthe largeamountsofdatathat isgeneratedandthusderivesomeusefulresultsfromit.Datavisualizationisamodernbranchofdescriptivestatistics.Itinvolvesthecreationand study of the visual representation of data,meaning information that has beenabstracted insomeschematicform, includingattributesorvariablesfortheunitsofinformation.

6

Hadoop:Apache Hadoop is an open-source software framework for storage and large-scaleprocessing of data-sets on clusters of commodity hardware. The Apache Hadoopframeworkiscomposedofthefollowingmodules:• Hadoop Common: Contains libraries and utilities needed by other Hadoop

modules

• HadoopDistributedFileSystem(HDFS):Adistributed_le-systemthatstoresdataoncommoditymachines,providingveryhighaggregatebandwidthacrossthecluster.

• HadoopYARN:Aresource-managementplatformresponsibleformanagingcompute resources in clusters and using them for scheduling of users'applications.

• HadoopMapReduce:Aprogrammingmodelforlargescaledataprocessing.HadoopStack:

TheHadoopTechnologyStackismadeofthefollowingcomponents:1.HDFS:TheHadoopDistributedFileSystemisthecustomized_lesystemmadefortheHadoopEcosystemwhichsupportslargeblocksizesandcoordinatesstoragebetweenmultipleDataNodes.2. MapReduce: Programming paradigm that allows for massive scalability acrosshundredsorthousandsofserversinaHadoopcluster3. Pig: Scripting Language that allows people using Hadoop to focus more onanalyzing large data sets and spend less time having towritemapper and reducerprograms4.Hive:HiveallowsSQLdeveloperstowriteHiveQueryLanguage(HQL)statementsthatarelikestandardSQLstatements;nowyoushouldbeawarethatHQLislimitedinthecommandsitunderstands.

7

HowHadoopworks:

Hadoop and its various components fit together to ensure a fault-tolerant, durableand highly efficient model for storage and management of Big Data. Thesecomponentsare:

• Namenode:Namenodeisthenodewhichstoresthefilesystemmetadatai.e.whichfilemapstowhatblocklocationsandwhichblocksarestoredonwhichdatanode.TheNamenodemaintainstwo in-memorytables,onewhichmapsthe blocks to datanodes (one block maps to 3 datanodes for a replicationvalueof3)andadatanodetoblocknumbermapping.Wheneveradatanodereportsadiskcorruptionofablock,thefirsttablegetsupdatedandwheneveradatanodeisdetectedtobedead(becauseofanode/networkfailure)boththetablesgetupdated.

• Secondary Namenode: The secondary Namenode regularly connects to theprimary Namenode and keeps snapshotting the system metadata intolocal/remotestorage.Itdoessoatapoorfrequencyandshouldnotbeheavilyreliedon.

8

• Datanode:Thedatanodeiswheretheactualdataresides.

PointstoNote:

1. AlldatanodessendaheartbeatmessagetotheNamenodeevery3

secondstosaythattheyarealive.

2. If theNamenodedoesnot receive aheartbeat fromadatanodefor10minutes,thenitconsidersthatdatanodetobedead/outofservice and initiates replication of blocks which were hosted onthatdatanodetobehostedonanotherdatanode.

3. Thedatanodescantalktoeachothertorebalancedata,moveandcopydataaroundandkeepthereplicationhigh.

4. When the datanode stores a block of information, itmaintains achecksumforitaswell.

5. ThedatanodesupdatetheNamenodewiththeblockinformation

periodicallyandbeforeupdatingverifythechecksums.

6. If the checksum is incorrect for a block i.e. there is a disk levelcorruption for that block, it skips that block while reporting theblockinformationtotheNamenode.

7.Inthisway,Namenodeisawareofthedisklevelcorruptiononthatdatanodeandtakesstepsaccordingly.

• NodeManager: This is a yarn daemon which runs on individual nodes and

receive updated information on resource containers from their individualdatanodesviabackgrounddaemons.differentresourcessuchasmemory,CPUtime, network bandwidth etc. are put into one unit called the ResourceContainer. she Node manager in turn ensures fault tolerance on the datanodesforanymapreducejobs

• ResourceManager: This is a yarn daemonwhichmanages the allocation of

resourcestothedifferent jobsapart fromcomprisingaschedulerwhich justtakes careof the scheduling jobswithoutworryingabout anymonitoringorstatusupdates

Hive:HiveisusedasasubstitutetoSQLfortheHadoopFileSystemwhichmakesiteasyforpeople acquaintedwith SQL to query data from it instead of having to learnMap-Reduce.

9

SparkandSharkStack:

Spark:

• ApacheSparkisanopen-sourcedataanalyticsclustercomputingframework.Spark fits into the Hadoop open-source community, building on top of theHadoopDistributedFileSystem(HDFS).

• Spark is not tied to the two-stage MapReduce paradigm, and promisesperformance up to 100 times faster than Hadoop MapReduce, for certainapplications.

• FollowstheconceptofaResilientDistributedDataset(RDD),whichallowsto

transparentlystoredataonmemoryandpersistittodiscifit'sneeded.

• ProvidesMachineLearningLibraryMLLibforDataAnalyticsonthefly.Shark:

• Shark is a large-scale data warehouse system for Spark designed to becompatiblewithApacheHive.

• ItcanexecuteHiveQLqueriesupto100timesfasterthanHivewithoutany

modificationtotheexistingdataorqueries.

• Shark supports Hive's query language, metastore, serialization formats, anduser- defined functions, providing seamless integration with existing Hivedeployments.

• SharksupportsHiveUDFs,SerDeandeverymajor functionality inHivesome

ofwhicharenotsupportedinsimilarqueryenginessuchasImpala.

• SparkiswritteninScala,andhasitsownversionoftheScalainterpreter.

• So,itspreferabletouseScala.

10

MongoDB:MongoDBisanopensource,document-oriented,NoSQLdatabasesystem.IntheedXplatform, the discussion form and some course related material is stored ascollections of JSON-like documents in a MongoDB database. We have used localMongoDb database system to store themongo dump _les retrieved from the edXdatapackage.MySQL:MySQLisarelationaldatabasemanagementsystem(RDBMS),andshipswithnoGUItoolstoadministerMySQLdatabasesormanagedatacontainedwithinthedatabases.ThedatafromMySQLdumpcanbeprocessedbyimportingintoHDFSusingSqoop.Sqoop:SqoopisatooldesignedtotransferdatabetweenHadoopandrelationaldatabases.We used Sqoop to import data from a relational database management system(RDBMS)suchasMySQLorOracleintotheHadoopDistributedFileSystem(HDFS).Sqoop automates most of the process, relying on the database to describe theschema for thedata tobe imported. SqoopusesMapReduce to import andexportthedata,whichprovidesparalleloperationaswellasfaulttolerance.

11

MeritsofusingPythonthanR:Pythonisahigh-levellanguage.Thelanguageprovidesconstructsintendedtoenableclear programs on both a small and large scale. R is a free software programminglanguage and software environment for statistical computing and graphics. The Rlanguageiswidelyusedamongstatisticiansanddataminersfordevelopingstatisticalsoftware.

• ThemainadvantageofPythonoverRisthatit'sarealprogramminglanguagein theC family. It scaleseasily, so it's conceivable thatanythingyouhave inyoursandboxcanbeusedinproduction.

• TheJDBCconnectorofpythonismuchfasterthantheRhiveconnector.Thus,thetimingtakenbypythontoreturnthedatafromhiveislesserthanthatinR.

• Also,pythonhasalibrary"Cython"usedforimprovingcomputationalspeedin

python.Cythonallowsonetostaticallytypevariablese.g.cdefintideclaresito be an integer. This gives massive speedups, as typed variables are nowtreatedusinglow-leveltypesratherthanPythonvariables.

GoogleCharts:GoogleChartsprovidesaperfectwaytovisualizedataonyourwebsite.Fromsimpleline charts to complex hierarchical tree maps, the chart gallery provides severalready-to-usecharttypes.ThemostcommonwaytouseGoogleChartsiswithsimpleJavaScript thatyouembed inyourwebpage.You loadsomeGoogleChart libraries,listthedatatobecharted,selectoptionstocustomizeyourchart,andfinallycreateachart objectwith an id that you choose. Then, later in thewebpage, you create a<div>withthatidtodisplaytheGoogleChart.Charts are exposed as JavaScript classes, and Google Charts provides many charttypesforyoutouse.Thedefaultappearancewillusuallybeallyouneed,andyoucanalways customizea chart to fit the lookand feelof yourwebsite.Charts arehighlyinteractive and expose events that let you connect them to create complexdashboardsorotherexperiencesintegratedwithyourwebpage.ChartsarerenderedusingHTML5/SVGtechnologytoprovidecross-browsercompatibility (includingVMLfor older IE versions) and cross platform portability to iPhones, iPads and Android.Youruserswillneverhavetomesswithpluginsoranysoftware. Iftheyhaveawebbrowser,theycanseeyourcharts.All chart typesarepopulatedwithdatausing theDataTableclass,making iteasy toswitch between chart types as you experiment to _nd the ideal appearance. TheDataTable provides methods for sorting, modifying, and filtering data, and can bepopulateddirectlyfromyourwebpage,adatabase,oranydataprovidersupportingthe Chart Tools Data source protocol. (That protocol includes a SQL-like query

12

language and is implementedbyGoogle Spread- sheets,Google FusionTables, andthirdpartydataproviderssuchasSalesforce.Youcaneven implement theprotocolonyourownwebsiteandbecomeadataproviderforotherservices.)Combinedworking:

DataLoading:Onchoosingthisoption,hegetsaformonwhichclientfillsupalltherequireddetailstoupload,updateorresetHDFS.Onclickingthesubmitbutton,dataissentthroughaDjango server to a functionwhich starts the process. Firstly, it is checkedwhethersomeotherprocessisalreadyrunningornot.If not, then current process starts by passing the required information to a pythonscriptwhichdependingon inputs startswith the cleaningof thedata stored in thepath givenbyuser followedby loadingontoHDFS. Before cleaning starts, it is alsoverified whether the path given contains appropriate data for processing. If datafound is not proper, then a message is sent from the script to the Django viewsshowing the details which is then shown to the user through a web interface andaskinghimforcorrectdetails.

13

Ifaprocesswasalreadyrunning,thelatestinstructioncannotbeper-formedasthiscancauseHDFStobeininconsistentstate,Also,userinstructioncannotbemadetolostifnotexecuted.Tosolvethisissue,aqueuingmechanismisusedusingdjango-rq.Django-rq is a a queuingmechanism based on redis and rq. It allows us to createmultiplequeues(aredefinedinsettings.pyfile)outofwhichaspertheprogrammer`schoice,anyqueuecanbeusedwithoutaffectingotherqueuesandeasilyputjobsintothem. Jobs kept into queue are executed as and when they previous job getscompleted. Till then they are stored somewhere waiting for their turn. There areworkerswhich reads the request of enqueuing processes and runs them. These rqworkersneedtobeactivewhenrequestismadeotherwiseprocesseswillnotbeabletoexecute.

14

ETLProcess:

this report studies various aspects of data analytics and...

Documents