Download - Hadoop
![Page 1: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/1.jpg)
A Study of Hadoop in Map-Reduce
Poumita DasShubharthi DasguptaPriyanka Das
![Page 2: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/2.jpg)
What is Big Data??
Big data is an evolving term that describes any voluminous amount of
structured, semi-structured and unstructured data that has the
potential to be mined for information.
![Page 3: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/3.jpg)
The 3 V’s
![Page 4: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/4.jpg)
Why DFS
![Page 5: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/5.jpg)
An introduction to Map-Reduce
Map-Reduce programs are designed to compute large volumes of data in a
parallel fashion. There are 3 steps
• Map
• Shuffle
• Reduce
![Page 6: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/6.jpg)
Map-Reduce continuedMap Shuffle Reduce
![Page 7: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/7.jpg)
What is Hadoop??
Apache Hadoop is a framework
that allows for the distributed
processing of large data sets
across clusters of commodity
computers using a simple
programming model.
![Page 8: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/8.jpg)
Hadoop core components
• Namenode
• Datanode
• Client
• User
• Job tracker
• Task tracker
![Page 9: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/9.jpg)
Namenode
The NameNode maintains the namespace tree and the mapping of
blocks to DataNodes. In a cluster there may exist hundreds or even
thousands of datanodes.
Secondary NameNode reads the metadata from RAM and writes it into a
secondary storage. However it is NOT a substitute of a NameNode
![Page 10: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/10.jpg)
Datanode
On startup, a DataNode connects to the NameNode; spinning until that
service comes up. It then responds to requests from the NameNode for
filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has
provided the location of the data.
![Page 11: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/11.jpg)
HDFS client
User applications access the filesystem using the HDFS client. A client has mainly 3
operations.
• Creating a new file
• File read
• File write
![Page 12: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/12.jpg)
Creating a new file
![Page 13: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/13.jpg)
File read
HDFS implements a single-
writer, multiple-reader model.
That is reading is a parallel
operation in Hadoop
![Page 14: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/14.jpg)
File write
An HDFS file consists of blocks.
When there is a need for a new
block, the NameNode allocates
a block with a unique block ID
and determines a list of
DataNodes to host replicas of
the block.
![Page 15: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/15.jpg)
Job tracker and task tracker
![Page 16: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/16.jpg)
Hadoop ecosystem
• PIG
• HIVE
• MAHOUT
![Page 17: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/17.jpg)
A Sample Program
![Page 18: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/18.jpg)
The Output
![Page 19: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/19.jpg)
Why Anagrams?
• Started out as a simple relaxation game, finding anagrams in sentences
• Games and Puzzles like Scrabble
• Ciphers, like permutation cipher, transposition ciphers
![Page 20: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/20.jpg)
Future scope
Keeping in mind the vast application of Hadoop we have certain graph-
searching techniques in mind that would be much more easier to solve
with the help of Map-reduce engine.
![Page 21: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/21.jpg)
References
• Introduction to Hadoop: Welcome to Apache https://hadoop.apache.org/ • Cloudera Documentation: Usage
http://www.cloudera.com/content/cloudera/en/documentation/hadoop-tutorial/CDH5/Hadoop-Tutorial/ht_usage.html • Edureka: Anatomy of a Map-Reduce Job
http://www.edureka.co/blog/anatomy-of-a-mapreduce-job-in-apache-hadoop/ • Stackoverflow: Explain Map-Reduce Simply
http://stackoverflow.com/questions/28982/please-explain-mapreduce-simply
![Page 22: Hadoop](https://reader036.vdocuments.us/reader036/viewer/2022062712/55d14a2abb61eb34578b4661/html5/thumbnails/22.jpg)
Thank you