hadoop – big deal

Download Hadoop – big deal

Post on 16-Apr-2017




1 download

Embed Size (px)


Hadoop Big Deal !!

Hadoop Big Deal !!Author : Abhishek Kumar +1-323-806-5474

ContentsWhat is HadoopHadoop ComponentsWhy HadoopHDFSHDFS FeaturesWhen Not to use HadoopHDFS ComponentsDFS and HDFSHadoop & Big data Relatives !

What is HadoopConventional DefinitionFramework for Distributed Processing of Large Datasets( usually unstructured data) across clusters of Commodity Hardware.

Well, I Have been really bad with these bookish definitions never understood the heavy terms used in them . So, here are some explanations Distributed Processing : Spreading a heavy task across various workers/resources to improve the time taken to deliver the task.Large Datasets(Unstructured data) : The data which does not have any defined structure /format or size.Commodity Hardware : Hardware easily available usually with low performance issues . These can failover anytime.

So, as of now we can say that Hadoop is nothing but a system that stores huge volume of unstructured data in a way that the data can be accessed for reading faster.

Fun Fact: Hadoop follows all standards, directory structure and other patters of LINUX/UNIX. Most details easily available on Apache web site.

Hadoop ComponentsLevel 0 - HadoopHadoop Distributed File SystemSimple Programming ModelHDFS : HDFS is just a file system that serves the storage of data, in hadoop way.MapReduce : Though termed as joint word but Map and Reduce are 2 separate programs that helps in defining the Map for data spread in distributed environment and reduce the complexity/volume of data sent/received or processed.

Why Hadoop !So If Hadoop is another storage system then why so hype !

Yes, Hadoop again is a Distributed File Processing system but I see something that makes it different or in fact special Faster I/O Processing using commodity hardware.

We all know that this generation has no issue with Storage size. We have TBs of hard drives available at home too . But, only problem remains is accessing the huge volume of unstructured data using low performance I/O devices we have. This is where Hadoop enters to rescue. How !! .. We might know that through other slides.

Fun Fact: Hadoop is not a software which you can download and install on your system. It is a set of tools organized to serve some specific purpose.

HDFSConventional Definition : HDFS is a file system designed for storing very large files with streaming data access patterns running clusters on commodity hardware.

Like Name Says It is a Distributed File System following some specific protocols/standards or techniques, we will call Hadoop way

HDFS AdvantagesFault ToleranceNow, if Hadoop has an important highlight in its definition i.e. using commodity hardware, then we can be certain of failovers. But Hadoop handles this failing nodes very effectively and ensures that we do not loose any data anyway. How read about replication ..Handles large Datasets No doubt why companies like Facebook, Google, yahoo etc. prefers it. So proven system for handling large data sets.Streaming access to File system data You have your youtube videos using this .High PerformanceThe facts says that the processing time for data using Hadoop is n-times faster, where n is number of nodes/data nodes.

When Not to Use Hadoop/HDFSFor many small files used in transactions

Low Latency data access

When there are many people who modifies the data/files ( multiple writers) arbitrarily.

HDFS ComponentsName Node : This component of HDFS is generally on a High Performance machine and if we talk in layman terms, it is kind of Index for the data spread across several data nodes. We can also call it metadata storage process. Data Node : This is responsible for storing actual data. This runs as Daemon in local machines. Fun Fact: Daemon is a resident program that runs in background on your machine as processes. Daemon is terminology used in UNIX. In DOS we call it TSR.

DFS and HDFSSo, what is difference between a regular Distributed File System and Hadoop !!

Hadoop processes the data in local nodes and just transmits the output to Client while in regular DFS data is brought to master node from various nodes for processing. So quiet obvious Hadoop has to transfer less amount of data( just the output) over network while a regular DFS has to transfer huge volume of data on network. This Makes Hadoop winner for faster processing!!

This type of processing of data on data nodes is called data localization which is one of the important super powers of Hadoop ..

Hadoop & Big Data Relatives !Relation is not very complex. Its just like simple husband-wife relation where Hadoop comes in just to resolves issues with Big data .

In other words, Big data provides challenges for Hadoop to resolve.

Thanks !

Probably will provide more details in next presentation