day7.hdfs & architecture

Post on 17-Jul-2016

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Day7.HDFS & Architecture

TRANSCRIPT

http://www.excelonlineclasses.co.nr/

http://www.excelonlineclasses.co.nr/

excel.onlineclasses@gmail.com

http://www.excelonlineclasses.co.nr/

Online Training Development Testing Job support Technical Guidance Job Consultancy Any needs of IT Sector

Excel Online Classes offers following services:

http://www.excelonlineclasses.co.nr/

HDFS

- Nagarjuna K

http://www.excelonlineclasses.co.nr/

HDFS Distributed FS designed to run on

Commodity Hardware

Provides high throughput access to application data , suitable for applications having large datasets

http://www.excelonlineclasses.co.nr/

Assumptions & Goals Hardware Failure Streaming Data Access Large Datasets Simple coherency Model Moving Computation cheaper than

moving data

http://www.excelonlineclasses.co.nr/

Hardware Failure Assumptions & Goals

HDFS instance many machines Each storing part of the data

Chances that any machine goes down can’t be avoided

Detection of faults, auto recovery is core architectural goal of HDFS

http://www.excelonlineclasses.co.nr/

Streaming Data Access Assumptions & Goals

HDFS is designed fro batch processing rather than interactive usage by users.

Emphasis on Data throughput Not on low Latency data access.

http://www.excelonlineclasses.co.nr/

Streaming Data Access Assumptions & Goals

HDFS built on !dea “Write once , Read many times pattern”

Overtime data set generated and placed in HDFS Analysis is done one large part of data , rather

than on first few records Time to read whole data set is more than

retrieving first or the last record.

http://www.excelonlineclasses.co.nr/

Large Datasets Assumptions & Goals

A typical file ranges from GB to TB

http://www.excelonlineclasses.co.nr/

Simple Coherency Model Assumptions & Goals

HDFS built on !dea “Write once , Read many times pattern”

The assumption enables high through put access

http://www.excelonlineclasses.co.nr/

Moving Computation OR Data ? Assumptions & Goals

Computation intensive porgraming

Data intensive programing

http://www.excelonlineclasses.co.nr/

Where HDFS doesn’t fit Low latency data access

Lots of small files

Multiple writers, arbitrary file modifications

http://www.excelonlineclasses.co.nr/

Where HDFS doesn’t fit Low latency data access

Lots of small files High latency time Each file (say 10 KB of size) takes up a block

in HDFS Compress All the metadata is stored in HDFS memory

http://www.excelonlineclasses.co.nr/

Where HDFS doesn’t fit Multiple writers, arbitrary file

modifications Single user writes files in HDFS.

Appending only at the end. Multiple sources of writing into a same file or writing at arbitrary offset is not supported (currently)

http://www.excelonlineclasses.co.nr/

Blocks disc has block size

minimum amount of data that is read/write

512 bytes FileSystem blocks are few multiple of

disc block size few KB

http://www.excelonlineclasses.co.nr/

Blocks In classical FS, single block may

contain data of only single file Leads to internal fragmentation.

Newer file systems, solves this problem by block suballocation tail merging

http://www.excelonlineclasses.co.nr/

Blocks HDFS also has a block size

64 MB

Unlike normal FS , if file is less than 64 MB it doesn’t occupy underlying storage of 64MB.

http://www.excelonlineclasses.co.nr/

Why BIG BLOCK size ? Throughput vs Latency

time to seek start of block Reading the whole block

http://www.excelonlineclasses.co.nr/

Why BIG BLOCK size ? seek time = 10ms transfer rate (throughput) = 100MBPS

make seek time 1% of transfer rate , block size = 100MB

Default is 64 MB As the transfer rate increases , Block

size can be increased

http://www.excelonlineclasses.co.nr/

hadoop fsck / -files -blocks Gives information about all the files and

blocks in the file system Replication

▪ under▪ over etc.,

corrupt ? etc.,

http://www.excelonlineclasses.co.nr/

File Permissions on HDFS Client’s identity determined

user name and groups from which it operates.

Sharing of FS shouldn’t be used hostile environment

Going forward Kerberos authentication

http://www.excelonlineclasses.co.nr/

Hadoop File Systems HDFS is just one implementation of

Hadoop FileSystems. org.apache.hadoop.fs.FileSystem

represents a FileSystem in hadoop

http://www.excelonlineclasses.co.nr/

Hadoop File Systems

http://www.excelonlineclasses.co.nr/

Hadoop File Systems

top related