the family of hadoop
DESCRIPTION
TRANSCRIPT
![Page 1: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/1.jpg)
The Family of Hadoop
Nham Xuan Nam
nhamxuannam [at] gmail.com
http://namnham.blogspot.com
Barcamp Saigon, December 13 2009
![Page 2: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/2.jpg)
Content
History Sub-projects HDFS Map Reduce HBase Hive
![Page 3: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/3.jpg)
History created by Doug Cutting, the creator of
Lucene.
Lucene: open source index & search library.
Nutch: Lucene-based web crawler.
Jun 2003, there was a successful 100 million page Nutch demo system.
Nutch problem: its architecture could not scale to the billions of pages.
![Page 4: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/4.jpg)
History Oct 2003, Google published the paper “The Google File System”.
In 2004, Nutch team wrote an open source implementation of GFS, called Nutch Distributed File System (NDFS).
Dec 2004, Google published the paper “MapReduce: Simplified Data Processing on Large Clusters”.
In 2005, Nutch team implemented MapReduce in Nutch.
Mid 2005, all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
![Page 5: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/5.jpg)
History Feb 2006, Nutch's NDFS and the MapReduce
implementation formed Hadoop project.
Doug Cutting joined Yahoo!.
Jan 2008, Hadoop became Apache top-level project.
Feb 2008, Yahoo! production search index was generated by a 10,000-core Hadoop cluster.
![Page 6: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/6.jpg)
History
Source: http://wiki.apache.org/hadoop/PoweredBy
![Page 7: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/7.jpg)
Sub-projects
![Page 8: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/8.jpg)
![Page 9: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/9.jpg)
Architecture
![Page 10: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/10.jpg)
Data Model File stored as blocks (default size: 64M)
Reliability through replication
– Each block is replicated to several datanodes
![Page 11: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/11.jpg)
Namenode & Datanodes Namenode (master)
– manages the filesystem namespace
– maintains the filesystem tree and metadata for all the files and directories in the tree.
Datanodes (slaves)
– store data in the local file system
– Periodically report back to the namenode with lists of all existing blocks
Clients communicate with both namenode and datanodes.
![Page 12: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/12.jpg)
Data Flow
![Page 13: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/13.jpg)
Data Flow
![Page 14: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/14.jpg)
Accessibility FileSystem Java API
– org.apache.hadoop.fs.*
Web Interface
Commands for HDFS users
$ hadoop dfs mkdir /barcamp
$ hadoop dfs ls /barcamp
Commands for HDFS admins
$ hadoop dfsadmin report
$ hadoop dfsadmin refreshNodes
![Page 15: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/15.jpg)
![Page 16: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/16.jpg)
Programming Model
![Page 17: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/17.jpg)
Programming Model Data is a stream of keys and values
Map
– Input: <key1,value1> pairs from data source
– Output: immediate <key2,value2> pairs
Reduce
– Called once per a key, in sorted order
Input: <key2, list of value2>
Output: <key3,value3> pairs
![Page 18: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/18.jpg)
Data Flow
![Page 19: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/19.jpg)
WordCount ExampleFile01:Hello Barcamp Hello Everyone
File02:Hello Hadoop Hello Everyone
<Hello, 2><Barcamp, 1><Everyone, 1>
<_, Hello Hadoop Hello Everyone><_, Hello Barcamp Hello Everyone>
<Hello, 2><Hadoop, 1><Everyone, 1>
<Barcamp, 1><Hadoop, 1><Hello, 4><Everyone, 2>
<Barcamp, [1]><Hadoop, [1]><Hello, [2,2]><Everyone, [1,1]>
![Page 20: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/20.jpg)
MapReduce in Hadoop JobTracker (master)
– handling all jobs.
– scheduling tasks on the slaves.
– monitoring & re-executing tasks.
TaskTrackers (slaves)
– execute the tasks.
Task
– run an individual map or reduce.
![Page 21: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/21.jpg)
MapReduce in Hadoop
![Page 22: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/22.jpg)
![Page 23: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/23.jpg)
Introduction Nov 2006, Google released the paper “Bigtable: A Distributed Storage System for Structured Data”
BigTable: distributed, column-oriented store, built on top of Google File System.
HBase: open source implementation of BigTable, built on top of HDFS.
![Page 24: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/24.jpg)
Data Model Data are stored in tables of rows and columns.
Cells are ”versioned”
→ Data are addressed by row/column/version key.
Table rows are sorted by row key, the table's primary key.
Columns are grouped into column families.
→ A column name has the form “<family>:<label>”
Tables are stored in regions.
Region: a row range [start-key : end-key)
![Page 25: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/25.jpg)
Data Model
![Page 26: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/26.jpg)
Architecture
![Page 27: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/27.jpg)
Architecture Master Server
– assigns regions to regionservers
– monitors the health of regionservers
– handles administrative funtions
RegionServers– contain regions and handle client read/write requests
Catalog Tables (ROOT and META)– maintain the current list, state, recent history, and
location of all regions.
![Page 28: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/28.jpg)
Accessibility Client APIorg.apache.hadoop.hbase.client.*
HBase Shell$ bin/hbase shellhbase>
Web Interface
![Page 29: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/29.jpg)
![Page 30: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/30.jpg)
Introduction started at Facebook
an open source data warehousing solution built on top of Hadoop
for managing and querying structured data
Hive QL: SQL-like query language
– compiled into map-reduce jobs
log processing, data mining,...
![Page 31: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/31.jpg)
Data Model Tables
– analogous to tables in RDBMS
– rows are organized into typed columns
– all the data is stored in a directory in HDFS
Partitions
– determine the distribution of data within sub-directories of the table directory
Buckets
– based on the hash of a column in the table
– Each bucket is stored as a file in the partition directory
![Page 32: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/32.jpg)
Architecture
![Page 33: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/33.jpg)
Architecture Metastore
– contains metadata about data stored in Hive.
– stored in any SQL backend or an embedded Derby.
– Database: a namespace for tables
– Table metadata: column types, physical layout,...
– Partition metadata
Compiler
Excution Engine
Shell
![Page 34: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/34.jpg)
Hive Query Language Data Definition (DDL) statements
– CREATE/DROP/ALTER TABLE
– SHOW TABLE/PARTITIONS
Data Manipulation (DML) statements
– LOAD DATA
– INSERT
– SELECT
User Defined functions: UDF/UDAF
![Page 35: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/35.jpg)
Hive @ Facebook
![Page 36: The Family of Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042606/54c658cb4a7959e9438b4594/html5/thumbnails/36.jpg)
The End
Thank you!