20110227 hadoop disk-linuxfb
TRANSCRIPT
What does a Hadoop Process do on Your Machine
Wang Xu
Feb, 2011
..1 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
Outline
...1 Hadoop: a Clone of Google Infrastructure
...2 What’s MapReduce
...3 How HDFS supports MapReduce and Others
...4 What’s DataNode Doing
...5 What’s TaskTracker Doing
..2 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
Apache Hadoop: History & Dreams
nutch, lucene. . .Yahoo and search engines. . .Doug Cutting. . .Yahoo, CloudEra, & Facebook
..3 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
The Hadoop Family
..
Projects and Their Relatives in Google.
.. Common: ipc, utils, and other common stuff
.. HDFS ⇐⇒ Google GFS: Distributed File System
.. MapReduce ⇐⇒ Google MapReduce: Framework of DistributedComputing
.. HBase ⇐⇒ BigTable: Column Family based Non-RelationalDatabase
.. Zookeeper ⇐⇒ Chubby: Distributed Lock Service, forQuorum. . .
.. Avro ⇐⇒ Protocol Buffers: Cross language data Serializationand Exchange
.. Hive & Pig: Data Warehouse based on MapReduce Platform
.. Oozie: Data flow engine
..4 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
How Hadoop Help Your Business
..
Usages of Hadoop.
.. Search Engine: Nutch Projects, Yahoo (Now Bing Based), andsome others
.. Log Analysis: for user behavior, network signalling, etc.
.. New Messaging system of Facebook is based on HBase
.. Advertisement: Yahoo and other company
.. Hive is used in Facebook
..5 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
The Nature of MapReduce
..
Map in Functional Programming.
.. Map: map({1,2,3,4}, (×2)) ⇒ {2,4,6,8}
.. Every elements are processed with given method
.. Elements do not affect each other
.. The input is immutable, and the output is a new list
.. Fit for Parallel Processing
..
Reduce in Functional Programming.
.. Reduce: reduce({1,2,3,4},(×)) Rightarrow {24}
.. All the elements in list are processed together
.. The input is immutable, and the output is a new list
..6 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
Distributed MapReduce
..
A Map Task’s Life.
.. Input: Segment of Input Records (from DFS)
.. Job: Process Records one by one — Emit K-V Pairs, 0, 1, orMore
.. Then: Working As a Server, Waiting the Reduce’s K-V retrivingrequest.
..
A Reduce Task’s Life.
.. Shuffle: Retrive from All Map Tasks for Specific Keys
.. Sort: Group and merge the K-V Pairs
.. Reduce: Write File Back to DFS
..7 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
The Landscape of MapReduce
Map 1
Map 2
Map 3
Map 4
Reduce 1
Reduce 2
Reduce 3
Figure: Data Flow of MapReduce
...1 Map read data from DFSseperately
...2 Map process the data, anddo not communicate eachother
...3 Map keep result in nodelocal storage (local disk)
...4 Reduce retrive data from allthe Maps
...5 Reduce do not communicateeach other either
...6 Reduce write back result toDFS
..8 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
Hadoop Distributed File System
..
Commodity PC based Massive Data Storage System.
.. Redundancy: block replicated to different nodes in differentracks
.. Location awareness, task can be sched to nodes storing data
.. Write once, read multi-times
.. Large files will be splitted to Blocks
..9 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
The Role of a DataNode
..
Block (chunk) container of HDFS.
.. Manage Dirs as a soft RAID0 — Write block files round-robin
.. Keep a block-dir Map in Memory
.. DataNodeProtocol(by NameNode): Communicate withNameNode — Report, Heartbeat and get command
.. DataTransferProtocol: Communicate with Client and otherDataNodes — Transfer Blocks
..10 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
DataNode in Disk
..
Block Files.
.. Those blk XXX
.. 64MB or 128MB blocks
..
Meta Files.
.. Those blk XXX.meta
.. Header: layout version, and bytes per checksum
.. Checksums
..11 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
Block Writing To DataNode
..
The Pipe Line.
.. Setup Pipe line: Client → DataNode1 → DataNode2 →DataNode3
.. DataNode: Receiving packet, and forward to next datanode
.. DataNode Write Received Data Buffer
.. DataNode then Write correspond meta
.. DataNode flush the file stream.
..12 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
The Role of a TaskTracker
..
Local Commander of a Node.
.. Running from begin to the end
.. Get task from JobTracker — The Big BOSS
.. Both Map and Reduce are runned by TaskTracker
.. Assign tasks to Mapper and Reducer Process
.. Work as Http Server (Jetty) for data transfer between TTs
..13 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
Daily Life of a Mapper
..
Direct Mapper Output.
.. Run map() against Every Records, and Collect The K-Vs
.. Write K-V into File (in OutputFormat) once got a K-V pair
.. Flush file.
..
Buffered Mapper (The Normal Case).
.. Run map() against Every Records, and Collect The K-Vs
.. collect K-V’s into a buffer set by io.sort.mb
.. Spill to external file if Map output fulfill the buffer.
.. Finally, do a external sort (Optional Combiner) and write to thefinal files
.. file: $local/taskTracker/jobcache/jobid/taskid/file.out
..14 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
Illustration of Map and Combiner from Yahoo
..
Combiner step inserted into the MapReduce data flow.
Figure: http://developer.yahoo.com/hadoop/tutorial/module4.html
..15 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
Life of a Reducer
..
Shuffle & Sort.
.. Copy map results from all Maps
.. Store map output in disk or memory
.. file:$local/taskTracker/jobcache/jobid/taskid/output/maplocationid.out
.. Sort: Merge the map outputs (like the Combiner in Map,hmmm. . . It should be combiner likes Sort)
..Reduce.
.. Write the result out with Output Format to HDFS
..16 / 17
.
What does a Hadoop Process do on Your Machine
. ▲
Q & A
..17 / 17
.
What does a Hadoop Process do on Your Machine
. ▲