big data and analytics - ada...

44
Big Data and Analytics Hadoop Ecosystem Dr. Abzetdin Adamov School of InformationTechnologyand Engineering ADA University http://site.ada.qu.edu.az/~aadamov

Upload: others

Post on 15-Mar-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

BigDataandAnalyticsHadoopEcosystem

Dr.Abzetdin AdamovSchoolofInformationTechnologyandEngineering

ADAUniversityhttp://site.ada.qu.edu.az/~aadamov

Page 2: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

PreviouslyCoveredTopics

• KeydifferencesofTraditionalandBigDataArchitecture• TransferringComputationPoweragainstTransferringData• SchemaonReadvsSchemaonWrite• HadoopCore– Storage:HDFSArchitecture• HadoopCore– Processing:MapReduce Architecture

Page 3: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

Objectives

• Vagrant+Provisioning+VirtualBox =RepeatableMultiWMs• Hadoop2.0vsHadoop1.0• HadoopEcosystemComponentsClassification• HadoopEcosystemComponentsKeyFeatures

Page 4: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks
Page 5: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks
Page 6: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks
Page 7: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

HadoopEcosystemComponents

Page 8: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

CompaniesbuildingontopofHadoop

• AmazonWebServices• Cloudera• Hortonworks• IBM• Intel• MapR Technologies• Microsoft• PivotalSoftware• Teradata

Page 9: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

PoweredbyApacheHadoop

• https://wiki.apache.org/hadoop/PoweredBy

• ThousandscompaniesandorganizationswithHadoopClustersizefromseveraltohundredsthousandsnodes(40.000atYahoo)

Page 10: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

HadoopCore=Storage+Compute

storage storage

storage storage

CPU RAM

YetAnotherResourceNegotiator(YARN)

HadoopDistributedFileSystem(HDFS)

Page 11: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

Hadoop2.0vsHadoop1.0

Page 12: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks
Page 13: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks
Page 14: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

Hadoop1.0Bottlenecks:HDFS/MapReduce

Page 15: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

Hadoop2.0Architechture

Page 16: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

YARN/MRv2vsMRv1Architecture

Page 17: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

Hadoop2.0vsHadoop1.0– Processing

Page 18: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

TheHadoopEcosystem

Hadoop

Page 19: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

HortonworksHadoopDistribution

Page 20: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

ClassificationofHadoopEcosystemComponents

AdministrationandServerCoordination Hue

DistributedStorage

ResourceManagement

ProcessingFramework

API

Analytics

Ambari Zookeeper

DataManagement Flume Sqoop

WorkflowEngine Oozie

WorkflowEngine Avro

HDFS

YARN

MapReduce

Mahout

MapReduce v2

MapReduce Pig HBase

Tez Hoya

Hive

Page 21: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

ClassificationofHadoopEcosystemComponents

Page 22: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

HadoopEcosystemComponents

Page 23: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

DataManagementFrameworks

Framework Description

HadoopDistributedFileSystem(HDFS)

AJava-based, distributedfilesystemthatprovidesscalable,reliable,high-throughputaccesstoapplication datastoredacrosscommodityservers

YetAnotherResourceNegotiator(YARN)

Aframeworkforcluster resourcemanagementandjobscheduling

Page 24: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

OperationsFrameworksFramework Description

Ambari AWeb-basedframework forprovisioning,managing,andmonitoringHadoopclusters

ZooKeeper Ahigh-performance coordinationservicefordistributedapplications

Cloudbreak AtoolforprovisioningandmanagingHadoopclustersinthecloud

Oozie Aserver-basedworkflowengine usedtoexecuteHadoopjobs

Page 25: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

Ambari WEBUI(REST)

Page 26: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

DataAccessFrameworksFramework DescriptionPig Ahigh-levelplatformforextracting, transforming,oranalyzinglargedatasets

Hive AdatawarehouseinfrastructurethatsupportsadhocSQLqueries

HCatalog Atableinformation,schema,andmetadatamanagementlayersupportingHive,Pig,MapReduce,andTezprocessing

Cascading Anapplication developmentframeworkforbuildingdataapplications,abstractingthedetailsofcomplexMapReduceprogramming

HBase Ascalable,distributed NoSQLdatabasethatsupportsstructureddatastorageforlargetables

Phoenix Aclient-sideSQLlayer overHBasethatprovideslow-latencyaccesstoHBasedata

Accumulo Alow-latency,largetabledatastorageandretrievalsystemwithcell-levelsecurity

Storm Adistributed computationsystemforprocessingcontinuousstreamsofreal-timedata

Solr Adistributedsearch platformcapableofindexingpetabytesofdata

Spark A fast,generalpurposeprocessingengineusetobuildandrunsophisticatedSQL,streaming,machinelearning,orgraphicsapplications

Page 27: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

GovernanceandIntegrationFrameworksFramework DescriptionFalcon Adatagovernancetoolprovidingworkfloworchestration, datalifecycle

management,anddatareplicationservices.WebHDFS ARESTAPI that usesthestandardHTTPverbstoaccess,operate,andmanage

HDFSHDFSNFSGateway A gatewaythatenables accesstoHDFSasanNFSmountedfile systemFlume A distributed,reliable,andhighly-availableservicethatefficientlycollects,

aggregates,andmovesstreamingdataSqoop Asetoftoolsfor importingandexportingdatabetweenHadoopandRDBM

systemsKafka Afast,scalable,durable,andfault-tolerantpublish-subscribemessagingsystemAtlas Ascalableandextensible setofcoregovernanceservicesenablingenterprisesto

meetcomplianceanddataintegrationrequirements

Page 28: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

SecurityFrameworksFramework DescriptionHDFS A storagemanagementservice providingfile anddirectorypermissions,even

moregranularfileanddirectoryaccesscontrollists,andtransparentdataencryption

YARN Aresourcemanagement servicewithaccesscontrollistscontrollingaccesstocomputeresourcesandYARNadministrativefunctions

Hive Adatawarehouseinfrastructure serviceprovidinggranularaccesscontrolstotablecolumnsandrows

Falcon Adatagovernancetoolprovidingaccesscontrol liststhatlimitwhomaysubmitHadoopjobs

Knox AgatewayprovidingperimetersecuritytoaHadoopclusterRanger Acentralized securityframeworkofferingfine-grainedpolicycontrolsforHDFS,

Hive,HBase,Knox,Storm,Kafka,andSolr

Page 29: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

EcosystemComponentVersions

Page 30: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

HadoopEcosystemComponents’KeyFeatures

Page 31: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

HADOOPECOSYSTEMCOMPONENTS

Its important to understand the components in Hadoop Ecosystem to build right solutions for a given business problem.

Page 32: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

ClassificationoftheHadoopEcosystemComponents

HadoopisstraightanswerforprocessingBigData.

HadoopEcosystemhasacombinationoftechnologieswhichproficientadvantageinsolvingData-orientedbusinessproblem.

Page 33: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

COREHADOOPHadoopDistributedFileSystem(HDFS)Standsfor:managingbigdatasetswithHighVolume, VelocityandVariety.

MapReduceStandsfor:processinghighvolumedistributeddata

YetAnotherResourceNegotiator(YARN)Standsfor:resourcemanagement,jobscheduling andmonitoring

Page 34: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

DATAACCESSApachePigStandsfor:highlevellanguagebuiltontopofMapReduce foranalyzinglargedatasetsandforDataFlow.

ApacheHiveStandsfor:highlevelquery languageanddatawarehouseinfrastructurebuilton topofHadoopforproviding datasummarization,queryandanalysis.

Page 35: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

DATASTORAGE

ApacheHBaseStandsfor:NoSQLdatabasebuiltforhostinglargetableswithbillionsofrowsandmillionsofcolumnsontopofHadoop.

CasandraStandsfor:NoSQLdatabasebasedonkey-valuemodeldesigned forlinearscalabilityandhighavailability.

Page 36: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

INTERACTION-VISUALIZATION-DEVELOPMENT

HcatalogStandsfor:providing integrationofHivemetadataforotherHadoopapplicationslikePig,MapReduce andothers.

LuceneStandsfor:high-performance, full-featuredtextsearchengine librarywrittenentirelyinJava.

HamaStandsfor:distributed frameworkbasedonBulkSynchronousParallel(BSP)computing formassivescientificcomputations likematrix,graphandnetworkalgorithms.

CrunchStandsfor:writing, testingandrunningMapReduce pipelines.

Page 37: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

DATAINELLIGENCE

ApacheDrillStandsfor:lowlatencySQLqueryengineforHadoopandNoSQL.

ApacheMahoutStandsfor:scalablemachinelearning librarydesigned forbuilding predictiveanalyticsonBigData.Mahoutnowhasimplementations apachesparkforfasterinmemorycomputing.

Page 38: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

DATAINTEGRATIONApacheSqoopStandsfor:lowlatencySQLqueryengine forHadoopandNoSQL.

ApacheFlumeStandsfor:distributed, reliable,andavailableserviceforefficientlycollecting,aggregating,andmovinglargeamountsoflogdata.

ApacheChukwaStandsfor:scalablelogcollectorusedformonitoring largedistributed filessystems.

Page 39: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

MANAGEMENT,MONITORINGandORCHESTRATION

ApacheAmbariStandsfor:simplifying Hadoopmanagementbyproviding aninterfaceforprovisioning,managingandmonitoring ApacheHadoopClusters.

ApacheZookeeperStandsfor:maintainingconfiguration informationnaming,providing distributedsynchronization, andprovidinggroupservices.

ApacheOozieStandsfor:schedulingworkflowtomanageApacheHadoop jobs.

Page 40: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

WhereCanWeUseMachineLearning(DataScience)

Healthcare• Predictdiagnosis• Prioritizescreenings• Reducere-admittancerates

Financialservices• FraudDetection/prevention• Predictunderwritingrisk• Newaccountriskscreens

PublicSector• Analyzepublicsentiment• Optimizeresourceallocation• Lawenforcement&security

Retail• Productrecommendation• Inventorymanagement• Priceoptimization

Telco/mobile• Predictcustomerchurn• Predictequipmentfailure• Customerbehavioranalysis

Oil&Gas• Predictivemaintenance• Seismicdatamanagement• Predictwellproduction levels

Page 41: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

YARNasaDataOperatingSystem

ApplicationsRunNativelyINHadoop

HDFS2(Redundant,ReliableStorage)

YARN(ClusterResourceManagement)

BATCH(MapReduce)

INTERACTIVE(Tez)

STREAMING(Storm)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPCMPI(OpenMPI)

EXISTING(Slider)

SEARCH(Solr)

Applicationsnowrun“in”Hadoop,insteadof“on”Hadoop.

Page 42: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

Next Generation AnalyticsIterative & ExploratoryData is the structure

Traditional AnalyticsStructured & Repeatable

Structure built to store data

42

ModernDataApplicationsapproachtoInsights

Start with hypothesisTest against selected data

Data leads the way Explore all data, identify correlations

Analyze after landing… Analyze in motion…

Page 43: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks
Page 44: Big Data and Analytics - ADA Universityaadamov/sources/slides/bigdata/week-4-BDA-Hadoop-Ecosystem.pdfHadoop 2.0 vs Hadoop 1.0 – Processing The Hadoop Ecosystem Hadoop. Hortonworks

Q&A ?Abzetdin Adamov,Assoc Prof.Emailmeat:[email protected]:@Linktomeat:www.linkedin.com/in/adamovVisitmyblogat:aadamov.wordpress.com