Download - Day7.HDFS & Architecture
http://www.excelonlineclasses.co.nr/
http://www.excelonlineclasses.co.nr/
http://www.excelonlineclasses.co.nr/
Online Training Development Testing Job support Technical Guidance Job Consultancy Any needs of IT Sector
Excel Online Classes offers following services:
http://www.excelonlineclasses.co.nr/
HDFS
- Nagarjuna K
http://www.excelonlineclasses.co.nr/
HDFS Distributed FS designed to run on
Commodity Hardware
Provides high throughput access to application data , suitable for applications having large datasets
http://www.excelonlineclasses.co.nr/
Assumptions & Goals Hardware Failure Streaming Data Access Large Datasets Simple coherency Model Moving Computation cheaper than
moving data
http://www.excelonlineclasses.co.nr/
Hardware Failure Assumptions & Goals
HDFS instance many machines Each storing part of the data
Chances that any machine goes down can’t be avoided
Detection of faults, auto recovery is core architectural goal of HDFS
http://www.excelonlineclasses.co.nr/
Streaming Data Access Assumptions & Goals
HDFS is designed fro batch processing rather than interactive usage by users.
Emphasis on Data throughput Not on low Latency data access.
http://www.excelonlineclasses.co.nr/
Streaming Data Access Assumptions & Goals
HDFS built on !dea “Write once , Read many times pattern”
Overtime data set generated and placed in HDFS Analysis is done one large part of data , rather
than on first few records Time to read whole data set is more than
retrieving first or the last record.
http://www.excelonlineclasses.co.nr/
Large Datasets Assumptions & Goals
A typical file ranges from GB to TB
http://www.excelonlineclasses.co.nr/
Simple Coherency Model Assumptions & Goals
HDFS built on !dea “Write once , Read many times pattern”
The assumption enables high through put access
http://www.excelonlineclasses.co.nr/
Moving Computation OR Data ? Assumptions & Goals
Computation intensive porgraming
Data intensive programing
http://www.excelonlineclasses.co.nr/
Where HDFS doesn’t fit Low latency data access
Lots of small files
Multiple writers, arbitrary file modifications
http://www.excelonlineclasses.co.nr/
Where HDFS doesn’t fit Low latency data access
Lots of small files High latency time Each file (say 10 KB of size) takes up a block
in HDFS Compress All the metadata is stored in HDFS memory
http://www.excelonlineclasses.co.nr/
Where HDFS doesn’t fit Multiple writers, arbitrary file
modifications Single user writes files in HDFS.
Appending only at the end. Multiple sources of writing into a same file or writing at arbitrary offset is not supported (currently)
http://www.excelonlineclasses.co.nr/
Blocks disc has block size
minimum amount of data that is read/write
512 bytes FileSystem blocks are few multiple of
disc block size few KB
http://www.excelonlineclasses.co.nr/
Blocks In classical FS, single block may
contain data of only single file Leads to internal fragmentation.
Newer file systems, solves this problem by block suballocation tail merging
http://www.excelonlineclasses.co.nr/
Blocks HDFS also has a block size
64 MB
Unlike normal FS , if file is less than 64 MB it doesn’t occupy underlying storage of 64MB.
http://www.excelonlineclasses.co.nr/
Why BIG BLOCK size ? Throughput vs Latency
time to seek start of block Reading the whole block
http://www.excelonlineclasses.co.nr/
Why BIG BLOCK size ? seek time = 10ms transfer rate (throughput) = 100MBPS
make seek time 1% of transfer rate , block size = 100MB
Default is 64 MB As the transfer rate increases , Block
size can be increased
http://www.excelonlineclasses.co.nr/
hadoop fsck / -files -blocks Gives information about all the files and
blocks in the file system Replication
▪ under▪ over etc.,
corrupt ? etc.,
http://www.excelonlineclasses.co.nr/
File Permissions on HDFS Client’s identity determined
user name and groups from which it operates.
Sharing of FS shouldn’t be used hostile environment
Going forward Kerberos authentication
http://www.excelonlineclasses.co.nr/
Hadoop File Systems HDFS is just one implementation of
Hadoop FileSystems. org.apache.hadoop.fs.FileSystem
represents a FileSystem in hadoop
http://www.excelonlineclasses.co.nr/
Hadoop File Systems
http://www.excelonlineclasses.co.nr/
Hadoop File Systems