big data and analytics - ada...

BigDataandAnalyticsHadoopEcosystem

Dr.Abzetdin AdamovSchoolofInformationTechnologyandEngineering

ADAUniversityhttp://site.ada.qu.edu.az/~aadamov

PreviouslyCoveredTopics

• KeydifferencesofTraditionalandBigDataArchitecture• TransferringComputationPoweragainstTransferringData• SchemaonReadvsSchemaonWrite• HadoopCore– Storage:HDFSArchitecture• HadoopCore– Processing:MapReduce Architecture

Objectives

• Vagrant+Provisioning+VirtualBox =RepeatableMultiWMs• Hadoop2.0vsHadoop1.0• HadoopEcosystemComponentsClassification• HadoopEcosystemComponentsKeyFeatures

HadoopEcosystemComponents

CompaniesbuildingontopofHadoop

• AmazonWebServices• Cloudera• Hortonworks• IBM• Intel• MapR Technologies• Microsoft• PivotalSoftware• Teradata

PoweredbyApacheHadoop

• https://wiki.apache.org/hadoop/PoweredBy

• ThousandscompaniesandorganizationswithHadoopClustersizefromseveraltohundredsthousandsnodes(40.000atYahoo)

HadoopCore=Storage+Compute

storage storage

CPU RAM

YetAnotherResourceNegotiator(YARN)

HadoopDistributedFileSystem(HDFS)

Hadoop2.0vsHadoop1.0

Hadoop1.0Bottlenecks:HDFS/MapReduce

Hadoop2.0Architechture

YARN/MRv2vsMRv1Architecture

Hadoop2.0vsHadoop1.0– Processing

TheHadoopEcosystem

Hadoop

HortonworksHadoopDistribution

ClassificationofHadoopEcosystemComponents

AdministrationandServerCoordination Hue

DistributedStorage

ResourceManagement

ProcessingFramework

Analytics

Ambari Zookeeper

DataManagement Flume Sqoop

WorkflowEngine Oozie

WorkflowEngine Avro

MapReduce

Mahout

MapReduce v2

MapReduce Pig HBase

Tez Hoya

ClassificationofHadoopEcosystemComponents

HadoopEcosystemComponents

DataManagementFrameworks

Framework Description

HadoopDistributedFileSystem(HDFS)

AJava-based, distributedfilesystemthatprovidesscalable,reliable,high-throughputaccesstoapplication datastoredacrosscommodityservers

YetAnotherResourceNegotiator(YARN)

Aframeworkforcluster resourcemanagementandjobscheduling

OperationsFrameworksFramework Description

Ambari AWeb-basedframework forprovisioning,managing,andmonitoringHadoopclusters

ZooKeeper Ahigh-performance coordinationservicefordistributedapplications

Cloudbreak AtoolforprovisioningandmanagingHadoopclustersinthecloud

Oozie Aserver-basedworkflowengine usedtoexecuteHadoopjobs

Ambari WEBUI(REST)

DataAccessFrameworksFramework DescriptionPig Ahigh-levelplatformforextracting, transforming,oranalyzinglargedatasets

Hive AdatawarehouseinfrastructurethatsupportsadhocSQLqueries

HCatalog Atableinformation,schema,andmetadatamanagementlayersupportingHive,Pig,MapReduce,andTezprocessing

Cascading Anapplication developmentframeworkforbuildingdataapplications,abstractingthedetailsofcomplexMapReduceprogramming

HBase Ascalable,distributed NoSQLdatabasethatsupportsstructureddatastorageforlargetables

Phoenix Aclient-sideSQLlayer overHBasethatprovideslow-latencyaccesstoHBasedata

Accumulo Alow-latency,largetabledatastorageandretrievalsystemwithcell-levelsecurity

Storm Adistributed computationsystemforprocessingcontinuousstreamsofreal-timedata

Solr Adistributedsearch platformcapableofindexingpetabytesofdata

Spark A fast,generalpurposeprocessingengineusetobuildandrunsophisticatedSQL,streaming,machinelearning,orgraphicsapplications

GovernanceandIntegrationFrameworksFramework DescriptionFalcon Adatagovernancetoolprovidingworkfloworchestration, datalifecycle

management,anddatareplicationservices.WebHDFS ARESTAPI that usesthestandardHTTPverbstoaccess,operate,andmanage

HDFSHDFSNFSGateway A gatewaythatenables accesstoHDFSasanNFSmountedfile systemFlume A distributed,reliable,andhighly-availableservicethatefficientlycollects,

aggregates,andmovesstreamingdataSqoop Asetoftoolsfor importingandexportingdatabetweenHadoopandRDBM

systemsKafka Afast,scalable,durable,andfault-tolerantpublish-subscribemessagingsystemAtlas Ascalableandextensible setofcoregovernanceservicesenablingenterprisesto

meetcomplianceanddataintegrationrequirements

SecurityFrameworksFramework DescriptionHDFS A storagemanagementservice providingfile anddirectorypermissions,even

moregranularfileanddirectoryaccesscontrollists,andtransparentdataencryption

YARN Aresourcemanagement servicewithaccesscontrollistscontrollingaccesstocomputeresourcesandYARNadministrativefunctions

Hive Adatawarehouseinfrastructure serviceprovidinggranularaccesscontrolstotablecolumnsandrows

Falcon Adatagovernancetoolprovidingaccesscontrol liststhatlimitwhomaysubmitHadoopjobs

Knox AgatewayprovidingperimetersecuritytoaHadoopclusterRanger Acentralized securityframeworkofferingfine-grainedpolicycontrolsforHDFS,

Hive,HBase,Knox,Storm,Kafka,andSolr

EcosystemComponentVersions

HadoopEcosystemComponents’KeyFeatures

HADOOPECOSYSTEMCOMPONENTS

Its important to understand the components in Hadoop Ecosystem to build right solutions for a given business problem.

ClassificationoftheHadoopEcosystemComponents

HadoopisstraightanswerforprocessingBigData.

HadoopEcosystemhasacombinationoftechnologieswhichproficientadvantageinsolvingData-orientedbusinessproblem.

COREHADOOPHadoopDistributedFileSystem(HDFS)Standsfor:managingbigdatasetswithHighVolume, VelocityandVariety.

MapReduceStandsfor:processinghighvolumedistributeddata

YetAnotherResourceNegotiator(YARN)Standsfor:resourcemanagement,jobscheduling andmonitoring

DATAACCESSApachePigStandsfor:highlevellanguagebuiltontopofMapReduce foranalyzinglargedatasetsandforDataFlow.

ApacheHiveStandsfor:highlevelquery languageanddatawarehouseinfrastructurebuilton topofHadoopforproviding datasummarization,queryandanalysis.

DATASTORAGE

ApacheHBaseStandsfor:NoSQLdatabasebuiltforhostinglargetableswithbillionsofrowsandmillionsofcolumnsontopofHadoop.

CasandraStandsfor:NoSQLdatabasebasedonkey-valuemodeldesigned forlinearscalabilityandhighavailability.

INTERACTION-VISUALIZATION-DEVELOPMENT

HcatalogStandsfor:providing integrationofHivemetadataforotherHadoopapplicationslikePig,MapReduce andothers.

LuceneStandsfor:high-performance, full-featuredtextsearchengine librarywrittenentirelyinJava.

HamaStandsfor:distributed frameworkbasedonBulkSynchronousParallel(BSP)computing formassivescientificcomputations likematrix,graphandnetworkalgorithms.

CrunchStandsfor:writing, testingandrunningMapReduce pipelines.

DATAINELLIGENCE

ApacheDrillStandsfor:lowlatencySQLqueryengineforHadoopandNoSQL.

ApacheMahoutStandsfor:scalablemachinelearning librarydesigned forbuilding predictiveanalyticsonBigData.Mahoutnowhasimplementations apachesparkforfasterinmemorycomputing.

DATAINTEGRATIONApacheSqoopStandsfor:lowlatencySQLqueryengine forHadoopandNoSQL.

ApacheFlumeStandsfor:distributed, reliable,andavailableserviceforefficientlycollecting,aggregating,andmovinglargeamountsoflogdata.

ApacheChukwaStandsfor:scalablelogcollectorusedformonitoring largedistributed filessystems.

MANAGEMENT,MONITORINGandORCHESTRATION

ApacheAmbariStandsfor:simplifying Hadoopmanagementbyproviding aninterfaceforprovisioning,managingandmonitoring ApacheHadoopClusters.

ApacheZookeeperStandsfor:maintainingconfiguration informationnaming,providing distributedsynchronization, andprovidinggroupservices.

ApacheOozieStandsfor:schedulingworkflowtomanageApacheHadoop jobs.

WhereCanWeUseMachineLearning(DataScience)

Healthcare• Predictdiagnosis• Prioritizescreenings• Reducere-admittancerates

Financialservices• FraudDetection/prevention• Predictunderwritingrisk• Newaccountriskscreens

PublicSector• Analyzepublicsentiment• Optimizeresourceallocation• Lawenforcement&security

Retail• Productrecommendation• Inventorymanagement• Priceoptimization

Telco/mobile• Predictcustomerchurn• Predictequipmentfailure• Customerbehavioranalysis

Oil&Gas• Predictivemaintenance• Seismicdatamanagement• Predictwellproduction levels

YARNasaDataOperatingSystem

ApplicationsRunNativelyINHadoop

HDFS2(Redundant,ReliableStorage)

YARN(ClusterResourceManagement)

BATCH(MapReduce)

INTERACTIVE(Tez)

STREAMING(Storm)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPCMPI(OpenMPI)

EXISTING(Slider)

SEARCH(Solr)

Applicationsnowrun“in”Hadoop,insteadof“on”Hadoop.

Next Generation AnalyticsIterative & ExploratoryData is the structure

Traditional AnalyticsStructured & Repeatable

Structure built to store data

ModernDataApplicationsapproachtoInsights

Start with hypothesisTest against selected data

Data leads the way Explore all data, identify correlations

Analyze after landing… Analyze in motion…

Q&A ?Abzetdin Adamov,Assoc Prof.Emailmeat:aadamov@ada.edu.azFollowmeat:@Linktomeat:www.linkedin.com/in/adamovVisitmyblogat:aadamov.wordpress.com

big data and analytics - ada...

Documents

next generation of apache hadoop mapreduce arun c. murthy -...

hadoop benchmark: evaluating cloudera, hortonworks, and mapr

hortonworks data platform - hadoop security guide · pdf...

dell emc hortonworks hadoop solution · dell emc...

hive odbc driver - hortonworks. we do hadoop

developing and deploying apache hadoop security owen...

hortonworks technical workshop: real time monitoring with...

open-bda hadoop summit 2014 - dr. tariq mahmood (hadoopable...

non-stop hadoop for hortonworks

business value of hadoop -...

page 1 © hortonworks inc. 2014 hdp with advanced security...

dell emc hortonworks hadoop...

protecting enterprise data in apache hadoop · pdf...

hortonworks tutorial hadoop hdfs mapreduce...

hortonworks - how hadoop makes the successful retailer

hpe reference architecture for hortonworks hdp 2.4 on hpe...

hadoop and spark-perfect together-(arun c. murthy,...

idera er/studio® data architect and hortonworks...

hadoop stories -...

hortonworks & bilot data driven transformations with hadoop