hadoop – big deal

Hadoop – Big Deal !!

Author : Abhishek Kumar +1-323-806-5474

Contents• What is Hadoop• Hadoop Components• Why Hadoop• HDFS• HDFS Features• When Not to use Hadoop• HDFS Components• DFS and HDFS• Hadoop & Big data – Relatives !

What is Hadoop• Conventional Definition• Framework for Distributed Processing of Large Datasets( usually unstructured data) across

clusters of Commodity Hardware.

• Well, I Have been really bad with these bookish definitions never understood the heavy terms used in them . So, here are some explanations –• Distributed Processing : Spreading a heavy task across various workers/resources to improve

the time taken to deliver the task.• Large Datasets(Unstructured data) : The data which does not have any defined structure /format

or size.• Commodity Hardware : Hardware easily available usually with low performance issues . These

can failover anytime.

So, as of now we can say that Hadoop is nothing but a system that stores huge volume of unstructured data in a way that the data can be accessed for reading faster.

• Fun Fact: Hadoop follows all standards, directory structure and other patters of LINUX/UNIX. Most details easily available on “Apache” web site.

Hadoop ComponentsLevel 0 - Hadoop

HDFS

MapReduce

Hadoop Distributed File System

Simple Programming Model

HDFS : HDFS is just a file system that serves the storage of data, in hadoop way.

MapReduce : Though termed as joint word but Map and Reduce are 2 separate programs that helps in defining the Map for data spread in distributed environment and reduce the complexity/volume of data sent/received or processed.

Why Hadoop !• So If Hadoop is another storage system then why so hype !

• Yes, Hadoop again is a Distributed File Processing system but I see something that makes it different or in fact special “Faster I/O Processing using commodity hardware”.

• We all know that this generation has no issue with Storage size. We have TBs of hard drives available at home too . But, only problem remains is accessing the huge volume of unstructured data using low performance I/O devices we have. This is where Hadoop enters to rescue. How !! .. We might know that through other slides.

• Fun Fact: Hadoop is not a software which you can download and install on your system. It is a set of tools organized to serve some specific purpose.

HDFSConventional Definition : HDFS is a file system designed for storing very large files with streaming data access patterns running clusters on commodity hardware.

Like Name Says – It is a Distributed File System following some specific protocols/standards or techniques, we will call Hadoop way

Map Reduce Engine

__________HDFS Cluster

Job Tracker__________Name Node

Task Tracker____________

Data Node


Data Node


Data Node


Data Node

HDFS Advantages• Fault Tolerance• Now, if Hadoop has an important highlight in its definition i.e.

”using commodity hardware”, then we can be certain of failovers. But Hadoop handles this failing nodes very effectively and ensures that we do not loose any data anyway. How – read about replication ..

• Handles large Datasets • No doubt why companies like Facebook, Google, yahoo etc.

prefers it. So proven system for handling large data sets.• Streaming access to File system data • You have your “youtube” videos using this .

• High Performance• The facts says that the processing time for data using Hadoop is

“n”-times faster, where n is “number of nodes/data nodes”.

When Not to Use Hadoop/HDFS

• For many small files used in transactions

• Low Latency data access

• When there are many people who modifies the data/files ( multiple writers) arbitrarily.

HDFS Components

Name Node(Job Tracker)

Data Nodes(Task Trackers)

Name Node : This component of HDFS is generally on a High Performance machine and if we talk in layman terms, it is kind of “Index” for the data spread across several data nodes. We can also call it metadata storage process.

Data Node : This is responsible for storing actual data. This runs as Daemon in local machines.

Fun Fact: Daemon is a resident program that runs in background on your machine as processes. Daemon is terminology used in UNIX. In DOS we call it TSR.

DFS and HDFS• So, what is difference between a regular Distributed File System and

Hadoop !!

• Hadoop processes the data in local nodes and just transmits the output to Client while in regular DFS data is brought to master node from various nodes for processing. So quiet obvious – Hadoop has to transfer less amount of data( just the output) over network while a regular DFS has to transfer huge volume of data on network. This Makes Hadoop winner for faster processing!!

• This type of processing of data on data nodes is called data localization which is one of the important super powers of Hadoop ..

Hadoop & Big Data – Relatives !

Relation is not very complex. Its just like simple husband-wife relation where Hadoop comes in just to resolves issues with Big data .

In other words, Big data provides challenges for Hadoop to resolve.

Thanks !

Probably will provide more details in next presentation

hadoop – big deal

Education