hadoop training in hyderabad | hadoop file system
DESCRIPTION
Hadoop Institutes : kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.www.kellytechno.com/Hyderabad/Course/Hadoop-TrainingTRANSCRIPT
7/17/2019 Hadoop Training in Hyderabad | Hadoop File System
http://slidepdf.com/reader/full/hadoop-training-in-hyderabad-hadoop-file-system 1/5
Hadoop File System
www.KellyTechno.com Page 1
Basic Features: HDFS
ó Highly fault-tolerant
ó High throughput
ó Suitable for applications with large data sets
ó Streaming access to file system data
ó Can be built out of commodity hardware
Fault tolerance:
ó Failure is the norm rather than exception
ó A HDFS instance may consist of thousands of server machines, each storing
part of the file system’s data.
ó Since we have huge number of components and that each component has
non-trivial probability of failure means that there is always some
component that is non-functional.
ó Detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS.
Data Characteristics:
ó Streaming data access
ó Applications need streaming access to data
ó
Batch processing rather than interactive user access.
ó Large data sets and files: gigabytes to terabytes size
ó High aggregate data bandwidth
ó Scale to hundreds of nodes in a cluster
7/17/2019 Hadoop Training in Hyderabad | Hadoop File System
http://slidepdf.com/reader/full/hadoop-training-in-hyderabad-hadoop-file-system 2/5
Hadoop File System
www.KellyTechno.com Page 2
ó Tens of millions of files in a single instance
ó Write-once-read-many: a file once created, written and closed need not be
changed – this assumption simplifies coherency
ó A map-reduce application or web-crawler application fits perfectly with this
model.
MapReduce:
Namenode and Datanodes:
ó Master/slave architecture
ó HDFS cluster consists of a single Namenode, a master server that manages
the file system namespace and regulates access to files by clients.
ó There are a number of DataNodes usually one per node in a cluster.
ó The DataNodes manage storage attached to the nodes that they run on.
ó HDFS exposes a file system namespace and allows user data to be stored in
files.
7/17/2019 Hadoop Training in Hyderabad | Hadoop File System
http://slidepdf.com/reader/full/hadoop-training-in-hyderabad-hadoop-file-system 3/5
Hadoop File System
www.KellyTechno.com Page 3
ó A file is split into one or more blocks and set of blocks are stored in
DataNodes.
DataNodes: serves read, write requests, performs block creation, deletion, andreplication upon instruction from Namenode.
HDFS Architecture:
File system Namespace:
ó Hierarchical file system with directories and files
ó Create, remove, move, rename etc.
ó Namenode maintains the file system
ó Any meta information changes to the file system recorded by the
Namenode.
ó An application can specify the number of replicas of the file needed:
replication factor of the file. This information is stored in the Namenode.
7/17/2019 Hadoop Training in Hyderabad | Hadoop File System
http://slidepdf.com/reader/full/hadoop-training-in-hyderabad-hadoop-file-system 4/5
Hadoop File System
www.KellyTechno.com Page 4
Data Replication:
ó HDFS is designed to store very large files across machines in a large cluster.
ó Each file is a sequence of blocks.
ó All blocks in the file except the last are of the same size.
ó Blocks are replicated for fault tolerance.
ó Block size and replicas are configurable per file.
ó The Namenode receives a Heartbeat and a BlockReport from each
DataNode in the cluster.
ó BlockReport contains all the blocks on a Datanode.
Replica Placement:
ó The placement of the replicas is critical to HDFS reliability and
performance.
ó Optimizing replica placement distinguishes HDFS from other distributed
file systems.
ó Rack-aware replica placement:
° Goal: improve reliability, availability and network bandwidth
utilization
° Research topic
ó Many racks, communication between racks are through switches.
ó Network bandwidth between machines on the same rack is greater than
those in different racks.
ó Namenode determines the rack id for each DataNode.
7/17/2019 Hadoop Training in Hyderabad | Hadoop File System
http://slidepdf.com/reader/full/hadoop-training-in-hyderabad-hadoop-file-system 5/5
Hadoop File System
www.KellyTechno.com Page 5
ó Replicas are typically placed on unique racks
° Simple but non-optimal
° Writes are expensive
° Replication factor is 3
° Another research topic?
ó Replicas are placed: one on a node in a local rack, one on a different node
in the local rack and one on a node in a different rack.
ó 1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across
remaining racks.
Replica Selection:
ó Replica selection for READ operation: HDFS tries to minimize the bandwidth
consumption and latency.
ó If there is a replica on the Reader node then that is preferred.
ó HDFS cluster may span multiple data centers: replica in the local datacenter is preferred over the remote one.