hadoop ecosystem overview - inspiring...

34
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Compu>ng Spring 2016 Adam Shook

Upload: others

Post on 24-Dec-2019

9 views

Category:

Documents


0 download

TRANSCRIPT

HadoopEcosystemOverview

CMSC491Hadoop-BasedDistributedCompu>ng

Spring2016AdamShook

Agenda

•  IntroduceHadoopprojectstoprepareyouforyourgroupwork–  In>matedetailwillbeprovidedinfuturelectures

•  Discusspoten>alusecasesforeachproject

Topics•  HDFS•  MapReduce•  YARN•  Sqoop•  Flume•  NiFi•  Pig•  Hive•  Streaming•  HBase•  Accumulo•  Avro

•  Parquet•  Mahout•  Oozie•  Storm•  ZooKeeper•  Spark•  SQL-on-Hadoop•  In-MemoryStores•  Cassandra•  KaWa•  Crunch•  Azkaban

HDFS

•  HadoopDistributedFileSystem– High-performancefilesystemforstoringdata

•  We’vetalkedaboutthisenough

HadoopMapReduce

•  High-performancefault-tolerancedataprocessingsystem

•  We’vealsotalkedaboutthisenough

YARN•  Abstractframeworkfordistributedapplica>ondevelopment

•  Splitfunc>onalityofJobTrackerintotwocomponents–  ResourceManager– Applica>onMaster

•  TaskTrackerbecomesNodeManager–  Containersinsteadofmapandreduceslots

•  ConfigurableamountofmemoryperNodeManager

MapReduce2.xonYARN

•  MapReduceAPIhasnotchanged– Binary-levelbackwardscompa>ble(norecompile)

•  Applica>onMasterlaunchesandmonitorsjobviaYARN

•  MapReduceHistoryServertostore…history

•  EnabledYahoo!toscalebeyond4,000nodes

HadoopEcosystem

•  CoreTechnologies– HadoopDistributedFileSystem– HadoopMapReduce

•  Manyothertools…– Whichwewillbediscussing…now

ApacheSqoop

•  ApacheprojectdesignedforefficienttransferbetweenApacheHadoopandstructureddatastores

•  UsethroughCLIandextendable

•  Usecases?

ApacheFlume

•  Distributed,reliable,availableserviceforcollec>ng,aggrega>ng,andmovinglargeamountsoflogdata

•  Configureagentsusingsimplefiles,extendable

•  Usecases?

ApacheNiFi

•  Aservicetoreliablymoveandmanipulatefilesbetweenclustersusingawebfront-end

•  UsesaGUItodropprocessorsandconnectthemtobuildworkflows

•  Usecases?

ApachePig

•  Plahormforanalyzinglargedatasetsthatconsistsofahigh-levellanguageforexpressingdataanalysisprograms

•  InfrastructurecompileslanguagetoasequenceofMapReduceprograms

•  Usecases?

ApacheHive

•  Datawarehousefacilita>ngqueryingandmanaginglargedatasets

•  CompilesSQL-likequeriesintoMapReduceprograms

•  Usecases?

HadoopStreaming

•  U>litytocreateandrunMapReducejobswithanyexecutableorscriptasthemapperorreducer

•  Justajarfile,notarealproject

•  Usecases?

Whichhigh-levelAPIisforyou?

•  Whatareyoucomfortablewith?•  Whatareyoubeingtoldtouse?

ApacheHBase

•  Distributed,scalable,bigdatastore•  Datastoredassortedkey/valuepairs,withthekeyconsis>ngofarowandcolumn

•  Usecases?

ApacheAccumulo

•  Robust,scalable,high-performancedatastorageandretrievalkey/valuestore

•  Cell-basedaccesscontrols–  i.e.cell-levelsecurity

•  Usecases?

ApacheAvro

•  Dataserializa>onsystemfortheHadoopecosystem

•  Usecases?

ApacheParquet

•  ColumnarstorageformatforHadoop

•  Usecases?

ApacheMahout

•  MachinelearninglibrarytobuildscalablemachinelearningalgorithmsimplementedontopofHadoopMapReduce

•  Usecases?

ApacheOozie

•  WorkflowschedulersystemtomanageApacheHadoopjobs

•  Usecases?

ApacheStorm

•  Distributedreal->mecomputa>onsystem•  Didn’thavealogoun>lJune2014

•  HowisthisdifferentthanMapReduce?•  Usecases?

ApacheZooKeeper

•  Efforttodevelopandmaintainandopen-sourceserverenablinghighlyreliabledistributedcoordina>on

•  Usecases?

ApacheSpark

•  Fastandgeneralengineforlarge-scaledataprocessing

•  Writeapplica>onsinJava,Scala,orPython

•  Usecases?

SQLonHadoop

•  ApacheDrill,ClouderaImpala,Facebook’sPresto,Hortonworks’sHiveS>nger,PivotalHAWQ,etc.

•  SQL-likeorANSISQLcompliantMPPexecu>onenginesusingHDFSasadatastore

•  Usecases?Nonusecases?

SampleArchitecture

HDFS

FlumeAgent

FlumeAgent

FlumeAgent

MapReduce Pig HBase Storm

Website

OozieWebserver

Sales

CallCenter SQL

SQL

OTHERHADOOPPROJECTSWe[maybe]won’tbecoveringtheseindetaillateron

Redis,Memcached,etc.

•  Open-sourcein-memorykey/valuestores

•  Usecases?

ApacheCassandra

•  NoSQLdatabaseformanaginglargeamountsofstructured,semi-structured,andunstructureddata

•  Supportforclustersspanningmul>pledatacenters•  UnlikeHBaseandAccumulo,dataisnotstoredonHDFS

•  Usecases?Nonusecases?

ApacheCrunch

•  Javaframeworkforwri>ng,tes>ng,andrunningMapReducepipelineswithasimpleAPI

•  Samecodeexecutesasalocaljob,asaMapReducejob,orasastreamingSparkjob

•  Usecases? *

*Notthereallogo,buttrulyfantas3c

ApacheKaWa

•  High-throughputdistributedpublish-subscribemessageservice

•  Usecases?

Azkaban

•  BatchworkflowjobschedulertorunHadoopjobs

•  Usecases?

Review

•  Alotofprojectsavailabletoyouforyourgrouproject

•  Thinkofaproblemyouareinterestedin,thenchoosetheappropriateprojectstosolveit

•  Keepinminddataingest,storage,processing,andegress

•  FeelfreetoexploreanduseotherprojectsthantheonesIhavelistedhere– Getpermissionifyouplanonusingitaspartofyourprojectquota

References

•  Allthoselogosarethepropertyoftheirowners

•  *.apache.org•  redis.io