![Page 1: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/1.jpg)
O’Reilly – Hadoop: The Definitive Guide
Ch.3 The Hadoop Distributed Filesystem
June 4th, 2010Taewhi Lee
![Page 2: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/2.jpg)
2
Outline
HDFS Design & Concepts Data Flow The Command-Line Interface Hadoop Filesystems The Java Interface
![Page 3: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/3.jpg)
3
Design Criteria of HDFS
Node failure handling
Commodity hardware
Write-once, read-many-times High throughput
Streaming data access
Petabyte-scale data
Very large files Growing of filesystem meta-data
Lots of small files
Multiple writers, arbitrary file update Low latency
Random data access
![Page 4: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/4.jpg)
4
HDFS Blocks
Block: the minimum unit of data to read/write/replicate
Large block size: 64MB by default, 128MB in practice– Small metadata volumes, low seek time
A small file does not occupy a full block– HDFS runs on top of underlying filesystem
Filesystem check (fsck)% hadoop fsck –files -blocks
![Page 5: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/5.jpg)
5
Namenode (master)
Single point of failure– Backup the persistent metadata files
– Run a secondary namenode (standby)
Task Metadata Managing filesystem tree & namespace
Namespace image, edit log (stored persistently)
Keeping track of all the blocks
Block locations (stored just in mem-ory)
![Page 6: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/6.jpg)
6
Datanodes (workers)
Store & retrieve blocks
Report the lists of storing blocks to the na-menode
![Page 7: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/7.jpg)
7
Outline
HDFS Design & Concepts Data Flow The Command-Line Interface Hadoop Filesystems The Java Interface
![Page 8: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/8.jpg)
8
Anatomy of a File Read
(according to network topology)
(first block) (next block)
![Page 9: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/9.jpg)
9
Anatomy of a File Read (cont’d)
Error handling– Error in client-datanode communication
Try next closest datanode for the block
Remember failed datanode for later blocks
– Block checksum error Report to the namenode
Client contacts datanodes directly to retrieve data
Key point
![Page 10: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/10.jpg)
10
Anatomy of a File Write
(file exist & permission check)
enqueue packetto data queue
move the packetto ack queue
dequeue the packetfrom ack queue
chose by replica placement strategy
![Page 11: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/11.jpg)
11
Anatomy of a File Write (cont’d)
Error handling– Datanode error while data is being written
Client
– Adds any packets in the ack queue to data queue
– Removes the failed datanode from the pipeline
– Writes the remainder of the data
Namenode
– Arranges under-replicated blocks for further repli-cas
Failed datanode
– Deletes the partial block when the node recovers later on
![Page 12: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/12.jpg)
12
Coherency Model
HDFS provides sync() method– Forcing all buffers to be synchronized to the
datanodes
– Applications should call sync() at suitable points Trade-off between data robustness and throughput
The current block being written that is not guar-anteed to be visible (even if the stream is flushed)
Key point
![Page 13: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/13.jpg)
13
Outline
HDFS Design & Concepts Data Flow The Command-Line Interface Hadoop Filesystems The Java Interface
![Page 14: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/14.jpg)
14
HDFS Configuration
Pseudo-distributed configuration– fs.default.name = hdfs://localhost/
– dfs.replication = 1
See Appendix A for more details
![Page 15: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/15.jpg)
15
Basic Filesystem Operations Copying a file from the local filesystem to HDFS
Copying the file back to the local filesystem
Creating a directory
% hadoop fs –copyFromLocal input/docs/quangle.txt hdfs://lo-calhost/user/tom/quangle.txt
% hadoop fs –copyToLocal hdfs://localhost/user/tom/quan-gle.txt quangle.copy.txt
% hadoop fs –mkdir books% hadoop fs -ls .Found 2 itemsdrwxr-xr-x - tom supergroup 0 2009-04-02 22:41 /user/tom/books-rw-r--r-- 1 tom supergroup 118 2009-04-02 22:29 /user/tom/quangle.txt
![Page 16: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/16.jpg)
16
Outline
HDFS Design & Concepts Data Flow The Command-Line Interface Hadoop Filesystems The Java Interface
![Page 17: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/17.jpg)
17
Hadoop Filesystems
Filesystem URI scheme
Java implementation(all under org.apache.hadoop)
Local file fs.LocalFileSystem
HDFS hdfs hdfs.DistributedFileSystem
HFTP hftp hdfs.HftpFileSystem
HSFTP hsftp hdfs.HsftpFileSystem
HAR har fs.HarFileSystem
KFS (Cloud-Store) kfs fs.kfs.KosmosFileSystem
FTP ftp fs.ftp.FTPFileSystem
S3 (native) s3n fs.s3native.NativeS3FileSystem
S3 (block-based) s3 fs.s3.S3FileSystem
% hadoop fs –ls file:///
Listing the files in the root directory of the local filesys-tem
![Page 18: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/18.jpg)
18
Hadoop Filesystems (cont’d)
HDFS is just one implementation of Hadoop filesystem
You can run MapReduce programs on any of these filesystems− But, DFS (e.g., HDFS, KFS) is better to process
large volumes of data
![Page 19: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/19.jpg)
19
Interfaces
Thrift– SW framework for scalable cross-language services develop-
ment
– It combines a software stack with a code generation engine to build services seamlessly between C++, Perl, PHP, Python, Ruby, …
C API– Uses JNI(Java Native Interface) to call a Java filesystem client
FUSE (Filesystem in USErspace)– Allows any Hadoop filesystem to be mounted as a standard filesys-
tem
WebDAV– Allows HDFS to be mounted as a standard filesystem over WebDAV
HTTP, FTP
![Page 20: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/20.jpg)
20
Outline
HDFS Design & Concepts Data Flow The Command-Line Interface Hadoop Filesystems The Java Interface
![Page 21: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/21.jpg)
21
Reading Data from a Hadoop URL
Using java.net.URL object
Displaying a file like UNIX cat commandThis method can only be called just once per JVM
![Page 22: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/22.jpg)
22
Reading Data from a Hadoop URL (cont’d) Sample Run
![Page 23: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/23.jpg)
23
Reading Data Using the FileSystem API HDFS file : Hadoop Path object : HDFS URI
![Page 24: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/24.jpg)
24
FSDataInputStream
A specialization of java.io.DataInputStream
![Page 25: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/25.jpg)
25
FSDataInputStream (cont’d)
Preserving the current offset in the file Thread-safe
![Page 26: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/26.jpg)
26
Writing Data
Create or append with Path object
Copying a local file to a HDFS, and shows progress
![Page 27: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/27.jpg)
27
FSDataOutputStream
Seeking is not permitted HDFS allows only sequential writes or appends
![Page 28: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/28.jpg)
28
Directories
Creating a directory
Often, you don’t need to explicitly create a di-rectory Writing a file will automatically creates any parent
directories
![Page 29: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/29.jpg)
29
File Metadata: FileStatus
![Page 30: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/30.jpg)
30
Listing Files
![Page 31: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/31.jpg)
31
File Patterns
Globbing: to use wildcard characters to match mutiple files
Glob characters and their meanings
![Page 32: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/32.jpg)
32
PathFilter
Allows programmatic control over matching
PathFilter for excluding paths that match a regex
![Page 33: O’Reilly – Hadoop: The Definitive Guide Ch.3 The Hadoop Distributed Filesystem June 4 th, 2010 Taewhi Lee](https://reader038.vdocuments.us/reader038/viewer/2022102818/56649cff5503460f949d16d1/html5/thumbnails/33.jpg)
33
Deleting Data
Removing files or directories permanently