Download - Hadoop at a glance
![Page 1: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/1.jpg)
HDFS at a glance
Students: An Du – Tan Tran – Toan Do – Vinh Nguyen
Instructor: Professor Lothar Piepmayer
![Page 2: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/2.jpg)
Agenda
1. Design of HDFS2.1. HDFS Concepts – Blocks2.1. HDFS Concepts - Namenode and datanode3.1 Dataflow - Anatomy of a read file3.2 Dataflow - Anatomy of a write file3.3 Dataflow - Coherency model 4. Parallel copying5. Demo - Command line
![Page 3: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/3.jpg)
The Design of HDFS
Very large distributed file systemUp to 10K nodes, 1 billion files, 100PB
Streaming data accessWrite once, read many times
Commodity hardwareFiles are replicated to handle hardware failure
Detect failures and recover from them
![Page 4: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/4.jpg)
Worst fit with
Low-latency data accessLots of small filesMultiple writers, arbitrary file modifications
![Page 5: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/5.jpg)
HDFS Blocks
Normal Filesystem blocks are few kilobytesHDFS has Large block size
Default 64MB Typical 128MB
Unlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block
![Page 6: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/6.jpg)
HDFS Blocks
A file is stored in blocks on various nodes in hadoop cluster.
HDFS creates several replication of the data blocks
Each and every data block is replicated to multiple nodes across the cluster.
![Page 7: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/7.jpg)
Dhruba Borthakur - Design and Evolution of the Apache Hadoop File System HDFS.pdf
HDFS Blocks
![Page 8: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/8.jpg)
Why blocks in HDFS so large?
Minimize the cost of seeks=> Make transfer time = disk transfer rate
![Page 9: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/9.jpg)
Benefit of Block abstraction
A file can be larger than any single disk in the network
Simplify the storage subsystemProviding fault tolerance and availability
![Page 10: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/10.jpg)
Namenode & Datanodes
![Page 11: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/11.jpg)
Namenode (master)– manages the filesystem namespace– maintains the filesystem tree and metadata for all the files and directories in the tree.
Datanodes (slaves)– store data in the local file system– Periodically report back to the namenode with lists of all existing blocks
Clients communicate with both namenode and datanodes.
Namenode & Datanodes
![Page 12: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/12.jpg)
Anatomy of a File Read
![Page 13: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/13.jpg)
Anatomy of a File Read
Benefits:- Avoid “bottle neck”- Multi-Clients
![Page 14: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/14.jpg)
Writing in HDFS
NamenodeDatanodeBlock
![Page 15: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/15.jpg)
![Page 16: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/16.jpg)
Writing in HDFS
Exeptions: Node failedPipeline close, remove block and addr of
failed nodeNamenode arrange new datanode
![Page 17: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/17.jpg)
Coherency Model
Not visible when copyinguse sync()Apply in applications
![Page 18: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/18.jpg)
Parallel copying in HDFS
Transfer data between clusters% hadoop distcp hdfs://namenode1/foo
hdfs://namenode2/barImplemented as MapReduce, each file per mapEach map take at least 256MBDefault max maps is 20 per nodeThe diffirent versions only supported by webhdfs
protocol:% hadoop distcp webhdfs://namenode1:50070/foo
webhdfs://namenode2:50070/bar
![Page 19: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/19.jpg)
Setup
Cluster with 03 nodes: 04 GB RAM 02 CPU @ 2.0Ghz+ 100G HDD
Using vmWare on 03 different serversNetwork: 100MbpsOperating System: Ubuntu 11.04
Windows: Not tested
![Page 20: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/20.jpg)
Setup Guide - Single Node
java runtime ssh http://hadoop.apache.org/common/
docs/r1.0.3/single_node_setup.html /etc/hadoop/core-site.xml /etc/hadoop/hdfs-site.xml
![Page 21: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/21.jpg)
Cluster
/etc/hadoop/masters /etc/hadoop/slaves http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html
![Page 22: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/22.jpg)
Command Line
Similar to *nix hadoop fs -ls / hadoop fs -mkdir /test hadoop fs -rmr /test hadoop fs -cp /1 /2 hadoop fs -copyFromLocal /3 hdfs://localhost/
Namedone-specific: hadoop namenode -format start-all.sh
![Page 23: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/23.jpg)
Command Line
Sorting: Standard method to test cluster TeraGen: Generate dummy data TeraSort: Sort TeraValidate: Validate sort result
Command Line: hadoop jar /usr/share/hadoop/hadoop-examples-
1.0.3.jar terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
![Page 24: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/24.jpg)
Benchmark Result
2 Nodes, 1GB data: 0:03:38 3 Nodes, 1GB data: 0:03:13
2 Nodes, 10GB data: 0:38:07 3 Nodes, 10GB data: 0:31:28
Virtual Machine's harddisks are the bottle-neck
![Page 25: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/25.jpg)
Who wins…
?
![Page 26: Hadoop at a glance](https://reader033.vdocuments.us/reader033/viewer/2022061221/54be2f0d4a7959ee018b46df/html5/thumbnails/26.jpg)
References
Hadoop The Definitive Guide