cs 378 – big data programmingdfranke/courses/2017fall/... · 2017. 10. 31. · hadoop ecosystem...

14
CS 378 – Big Data Programming Lecture 18 Hadoop Ecosystem CS 378 - Fall 2017 Big Data Programming 1

Upload: others

Post on 14-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

CS378–BigDataProgramming

Lecture18HadoopEcosystem

CS378-Fall2017 BigDataProgramming 1

Page 2: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

HadoopEcosystem

•  Manyothertoolshavebeenimplementedon– Hadoop– HDFS(HadoopDistributedFileSystem)

•  We’llbrieflydiscussafew– HBase– ZooKeeper– Pig,Impala– Hive

CS378-Fall2017 BigDataProgramming 2

Page 3: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

HadoopEcosystemthebigdatablog.weebly.com

CS378-Fall2017 BigDataProgramming 3

Page 4: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

HBase

•  Column-orientdatabase–  ImplementedontopofHDFS– Distributed

•  Goalistoscaletoverylargedatasets– Withreal-Tmeread/writeaccess

CS378-Fall2017 BigDataProgramming 4

Page 5: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

Column-orientedDatabase

•  Tablecellsaretheunitofaccess– Contentisuninterpretedarrayofbytes– Acellisversioned(canhavemulTpleversions)

•  Tablecellisaccessedby– Row,column,andversion(oZenaTmestamp)

•  Columnsaregroupedintofamilies

CS378-Fall2017 BigDataProgramming 5

Page 6: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

Column-orientedDatabase

•  Newcolumnfamilymemberscanbeadded

•  Columnfamilymembersarestoredtogether•  Forbestperformance,familymembersshouldbeaccessedtogether

•  Rowscanbesubsetintoregions

CS378-Fall2017 BigDataProgramming 6

Page 7: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

HbaseFigure13-1fromHadoopTheDefiniTveGuide3rdEdiTon

CS378-Fall2017 BigDataProgramming 7

Page 8: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

ZooKeeper

•  MessagingandsynchronizaToninadistributedenvironment– Distributedqueues,locks– LeaderelecTonamongagroupofpeers

•  Highavailability(toleratesfailures)•  LooselycoupledinteracTons– Rendezvousmechanism

CS378-Fall2017 BigDataProgramming 8

Page 9: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

Pig

•  HigherleveldatastructuresandoperaTons– HigherlevelthanJavacodeformap-reducejob

•  Language:PigLaTn– OperaTonsandtransformaTonsondata– Pigconvertsthesetomap-reducejobsforyou

•  ThinkofitasaquerylanguagefordatainHDFS

CS378-Fall2017 BigDataProgramming 9

Page 10: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

PigExamplesSummarizaTon

CS378-Fall2017 BigDataProgramming 10

Page 11: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

PigExamplesBinning

CS378-Fall2017 BigDataProgramming 11

Page 12: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

PigExamplesJoin

CS378-Fall2017 BigDataProgramming 12

Page 13: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

Impala

•  InteracTveSQLfordatainHDFS,HBase

•  SQLprocessingengine– ParallelexecuTon– Horizontalscaling

•  Runsoneachdatanode– DirectaccesstoHDFS,HBase(nomap-reduce)

CS378-Fall2017 BigDataProgramming 13

Page 14: CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

Hive

•  DatawarehouseontopofHadoop

•  SQLforaccess– Hiveconvertsaqueryintoaseriesofmap-reducesteps

•  VariousHiveclientsareavailable–  JDBC,ODBC,ThriZ,…

CS378-Fall2017 BigDataProgramming 14