apache hadoop hdfs

Apache Hadoop HDFS

What is it ?

What is it for ?

Architecture

Resilience

Administration

Data access

Future changes ?

HDFS What is it ?

HDSF = Hadoop Distributed File System

It is a distributed file system

Runs on low cost hardware

It is open source

Written in Java

Fault tolerant

Designed for very large data sets

Tuned for high throughput

HDFS What is it for ?

Designed for batch processing

Streaming access to data

Large data sizes i.e. Terabytes

Highly reliable using data replication

Supports very large node clusters

Supports large files

Supports file numbers into millions

HDFS Architecture

HDFS Architecture

Has a master / slave architecture

A master NameNode

Controls file system operations

Maps data blocks to DataNodes

Logs all changes

Slave DataNodes

Store file blocks

Store replicated data

HDFS Resilience

Data is replicated across DataNodes

Nodes may fail but data is still available

DataNodes indicate state via heart beat report

Single point of failure in master NameNode

Data integrity via check sums

HDFS Administration

Access via Java API

FS Shell commands language

HTTP browser

C wrapper for Java API

Space reclamation

Via control of replication factor

Deleted files sent to trash folder

Trash folder cleaned after configurable time

HDFS Future changes

Things they might consider for HDFS

File append

User quotas

File links

Stand by nodes

Other Areas

Want to know about ?

Big Data

Nutch

Solr

see my other presentations

Contact Us

Feel free to contact us at

www.semtech-solutions.co.nz

[email protected]

We offer IT project consultancy

We are happy to hear about your problems

You can just pay for those hours that you need

To solve your problems

apache hadoop hdfs

Technology