hadoop distributed file system(hdfs) : behind the scenes

30
Hadoop Distributed File System : Behind the Scenes

Upload: nitin-khattar

Post on 24-Jan-2015

5.021 views

Category:

Technology


1 download

DESCRIPTION

The presentation sail you through the deep concepts of HDFS architecture, Where HDFS fits in Hadoop, What is HDFS Architecture and What is its role...

TRANSCRIPT

Page 1: Hadoop Distributed File System(HDFS) : Behind the scenes

Hadoop Distributed File System :

Behind the Scenes

Page 2: Hadoop Distributed File System(HDFS) : Behind the scenes

WHAT IS

Page 3: Hadoop Distributed File System(HDFS) : Behind the scenes
Page 4: Hadoop Distributed File System(HDFS) : Behind the scenes

WHERE HDFS FITS IN HADOOP?

Page 5: Hadoop Distributed File System(HDFS) : Behind the scenes

LET’S FIRST UNDERSTAND BUZZWORDSIN THE HADOOP

WORLD

Page 6: Hadoop Distributed File System(HDFS) : Behind the scenes

REPLICATION

Page 7: Hadoop Distributed File System(HDFS) : Behind the scenes

FAULT TOLERANCE

Page 8: Hadoop Distributed File System(HDFS) : Behind the scenes

LOAD BALANCING

Page 9: Hadoop Distributed File System(HDFS) : Behind the scenes

RELIABILITY

Page 10: Hadoop Distributed File System(HDFS) : Behind the scenes

CLUSTERING

Page 11: Hadoop Distributed File System(HDFS) : Behind the scenes

IT’S TIME FOR DEEP DIVE…

Page 12: Hadoop Distributed File System(HDFS) : Behind the scenes

HDFS ARCHITECTUREName NodeData NodeTask TrackerJob trackerImage and JournalHDFS ClientCheckpoint NodeBackup Node

Page 13: Hadoop Distributed File System(HDFS) : Behind the scenes

Name Node

Job Tracker Checkpoint

Data Node 1 DataNode 2 DataNode N………..

Task Tracker Task Tracker Task Tracker

Backup Node

Image Journal

HDFS Client

Page 14: Hadoop Distributed File System(HDFS) : Behind the scenes

NAME NODE

Job Tracker

Inode Image

Checkpoint

Journal

Page 15: Hadoop Distributed File System(HDFS) : Behind the scenes

Inode - Files and directories are represented on the NameNode, which record attributes like permissions, modification and access times, namespace and disk space quotas.

Image - The inode data and the list of blocks

belonging to each file

Checkpoint - The persistent record of the image stored in the local host’s native file system

Journal - Write-ahead commit log for changes to the file system that must be persistent.

Page 16: Hadoop Distributed File System(HDFS) : Behind the scenes

DATA NODE

On Start Up…

Data NodeNameNode

Page 17: Hadoop Distributed File System(HDFS) : Behind the scenes

DATA NODE

Data Node Name Node

TotalStorage Capacity

Fraction Storage

#Data TransfersIn Progress

Commands

Page 18: Hadoop Distributed File System(HDFS) : Behind the scenes

HDFS CLIENT

Page 19: Hadoop Distributed File System(HDFS) : Behind the scenes

IMAGE & JOURNAL

Flush & Sync Operation

Page 20: Hadoop Distributed File System(HDFS) : Behind the scenes

CHECKPOINT NODE

Page 21: Hadoop Distributed File System(HDFS) : Behind the scenes

BACKUP NODE

Page 22: Hadoop Distributed File System(HDFS) : Behind the scenes

FILE I/O OPERATIONS

Single Writer

Multiple Reader

Page 23: Hadoop Distributed File System(HDFS) : Behind the scenes

DATA WRITE OPERATION

DN1

DN4

DN2

DN3

Client Name Node

client DN1 DN2 DN3

setup

packet1

packet2

packet3

packet4

packet5

close

Page 24: Hadoop Distributed File System(HDFS) : Behind the scenes

DATA WRITE/READ OPERATION

DN1

Client Name Node

Single Writer Multiple Reader Model

Lease Management (Soft Limit and Hard Limit)

Pipelining, Buffering and Hflush

Checksum for data integrity

Choosing nodes for read operation

Page 25: Hadoop Distributed File System(HDFS) : Behind the scenes

BLOCK PLACEMENT

DN1 DN2 DN3 DN4 DN5

RACK1

DN6 DN7 DN8 DN9 D10

RACK2

/

Journal

Inode Image

checkpoint

D11 D12 D13 D14 D15

RACK3

Client

Name NodeAdd(data)

Data Nodes for Replica

Page 26: Hadoop Distributed File System(HDFS) : Behind the scenes

REPLICATION MANAGEMENT

DN1 DN2 DN3 DN4 DN5

RACK1

DN6 DN7 DN8 DN9 D10

RACK2

/

D11 D12 D13 D14 D15

RACK3

Over ReplicatedUnder Replicated

Journal

Inode Image

checkpoint

Name Node

Page 27: Hadoop Distributed File System(HDFS) : Behind the scenes

Balancing the disk space utilization on individual data nodes.

Based on utilization threshold. Utilization balancing follows block placement

policy.

BALANCER

Page 28: Hadoop Distributed File System(HDFS) : Behind the scenes

Scanner verifies the data integrity based on checksum.

SCANNER

Page 29: Hadoop Distributed File System(HDFS) : Behind the scenes
Page 30: Hadoop Distributed File System(HDFS) : Behind the scenes