map reduce in hadoop
TRANSCRIPT
MAP REDUCEBy Ishan Sharma
Animation in presentation can be viewed by downloading it…
WHAT IS MapReduce ?
A Programming model and an associated
implementation for processing and generating large
data sets with a parallel*, distributed* algorithm on a
cluster*.
A Parallel algorithm is an algorithm which can be executed a piece at a time
on many different processing devices, and then combined together again at
the end to get the correct result.
A distributed algorithm is an algorithm designed to run on computer hardware
constructed from interconnected processors.
A computer cluster consists of connected computers that work together so
that, in many respects, they can be viewed as a single system. Computer
clusters have each node set to perform the same task, controlled and
scheduled by software.
What is Map()?A MapReduce program is composed of
a Map() procedure that takes one pair of data with a type
in one data domain, and returns a list of pairs in a
different domain.
It is applied in parallel to every pair in the input dataset.
This produces a list of pairs for each call.
What is Reduce()?
A MapReduce program is composed of a
Reduce() procedure
that is applied in parallel to all pairs with the same key
from all lists which in turn produces a collection of values
in the same domain. The returns of all calls are collected
as the desired result list.
DN1
TaskTracker
META DATADN1 :A,B,CDN2:D,B,C::DN8:
DN4
TaskTracker
DN8
TaskTracker
DN6
TaskTracker
DN2
TaskTracker
DN7
TaskTracker
DN3
TaskTracker
DN5
TaskTracker
NameNode
JobTracker
Working of JobTracker & TaskTracker in
MapReduce engine of Hadoop
map map
o/po/p
Reducer
JOBCONF (User Interface )
JobTracker And
TaskTraker• The primary function of the Job tracker is resource
management (managing the task trackers), tracking
resource availability and task life cycle management
(tracking its progress, fault tolerance etc.)
• The task tracker has a simple function of following the
orders of the job tracker and updating the job tracker
with its progress status periodically.
The task tracker is pre-configured with a number of slots
indicating the number of tasks it can accept.
Fault Tolerance
▫ The task tracker spawns different JVM processes to ensure that process failures do not bring down the task tracker.
▫ The task tracker keeps sending heartbeatmessages to the job tracker to say that it is alive and to keep it updated with the number of empty slots available for running more tasks.
▫ From version 0.21 of Hadoop, the job tracker does some checkpointing of its work in the filesystem.
Basic Allowable text file formats• TextInputFormat
• KeyValueTextInputFormat
• SequenceFileInputFormat
• SequenceFileasTextInputFormat
Primitive class
datatypes
int
float
Long
char
String
Box class
datatypes
IntWritable
FloatWritable
LongWritable
Text
TextBox class have by-default writable comparable interface.
(ByteOffset,EntireLine)(ByteOffset,EntireLine)(ByteOffset,EntireLine)(ByteOffset,EntireLine)
inputSplitinputSplit inputSplitinputSplit
RecordReader RecordReader RecordReader RecordReader
Mapper Mapper MapperMapper
Input file 200MB64MB
What is your name
Where do you live
64MB
I am Ishan
I live in Delhi
64MB
Name of your
college
I study in MAIT
8MB
What are you
hobbies
0, What is your
name19, where do you live
What,1Is,1Your,1Name,1
Where,1Do,1You,1Live,1
I ,1Am,1Ishan,1
I,1Live,1In,1Delhi,1
Name,1Of,1Your,1College,1
I,1Study,1In,1MAIT,1
What,1Are,1Your,1Hobbies,1
INTERMEDIATE DATA
WORDCOUNT JOB Animation in slide
Where,1
Do,1
You,1
Live,1
I ,1
Am,1
Ishan,1
I,1
Live,1
In,1
Delhi,1
Name,1
Of,1
Your,1
College,1
I,1
Study,1
In,1
MAIT,1
What,1
Are,1
Your,1
Hobbies,1
INTERMEDIATE DATA
What,2 . . . . . Is,1 . . . . .Your,3 . . . . .Name,2 . . . . .
SHUFFLING
Am,1Are,1..Your,3
SORTING
Reducer
RecordWriter OutputFile (PART-0000)
What,1
Is,1
Your,1
Name,1
What,1Is,1
Your,1
Name,1
What,1Are,1
Your,1
Hobbies,1
What,1
Is,1Your,1
Name,1
What,1
Is,1
Your,1Name,1
What,1
Is,1
Your,1
Name,1
Name,1
Of,1
Your,1College,1
What,1
Are,1
Your,1Hobbies,1
What,1
Are,1
Your,1
Hobbies,1
Name,1Of,1
Your,1
College,1
Fields where MapReduce can be
implemented
Distributed pattern-based searching
Distributed sorting
Web link-graph reversal
Web access log stats
Document clustering
Statistical machine translation.
Limitations of MapReduce
• It's not always very easy to implement each and everything as a MapReduce program.
• When your intermediate processes need to talk to each other.
• When your processing requires lot of data to be shuffled over the network.
• The fundamentals of Hadoop were not designed tofacilitate highly interactive analytics.
• The answer you get from a Hadoop cluster may or may not be 100% accurate, depending on the nature of the job.
• Multiple copies of already big data.
END OF PRESENTATION
THANKS FOR WATCHING…