map reduce in hadoop

MAP REDUCEBy Ishan Sharma

Animation in presentation can be viewed by downloading it…

WHAT IS MapReduce ?

A Programming model and an associated

implementation for processing and generating large

data sets with a parallel*, distributed* algorithm on a

cluster*.

A Parallel algorithm is an algorithm which can be executed a piece at a time

on many different processing devices, and then combined together again at

the end to get the correct result.

A distributed algorithm is an algorithm designed to run on computer hardware

constructed from interconnected processors.

A computer cluster consists of connected computers that work together so

that, in many respects, they can be viewed as a single system. Computer

clusters have each node set to perform the same task, controlled and

scheduled by software.

What is Map()?A MapReduce program is composed of

a Map() procedure that takes one pair of data with a type

in one data domain, and returns a list of pairs in a

different domain.

It is applied in parallel to every pair in the input dataset.

This produces a list of pairs for each call.

What is Reduce()?

A MapReduce program is composed of a

Reduce() procedure

that is applied in parallel to all pairs with the same key

from all lists which in turn produces a collection of values

in the same domain. The returns of all calls are collected

as the desired result list.

DN1

TaskTracker

META DATADN1 :A,B,CDN2:D,B,C::DN8:

DN4

TaskTracker

DN8

TaskTracker

DN6

TaskTracker

DN2

TaskTracker

DN7

TaskTracker

DN3

TaskTracker

DN5

TaskTracker

NameNode

JobTracker

Working of JobTracker & TaskTracker in

MapReduce engine of Hadoop

map map

o/po/p

Reducer

JOBCONF (User Interface )

JobTracker And

TaskTraker• The primary function of the Job tracker is resource

management (managing the task trackers), tracking

resource availability and task life cycle management

(tracking its progress, fault tolerance etc.)

• The task tracker has a simple function of following the

orders of the job tracker and updating the job tracker

with its progress status periodically.

The task tracker is pre-configured with a number of slots

indicating the number of tasks it can accept.

Fault Tolerance

▫ The task tracker spawns different JVM processes to ensure that process failures do not bring down the task tracker.

▫ The task tracker keeps sending heartbeatmessages to the job tracker to say that it is alive and to keep it updated with the number of empty slots available for running more tasks.

▫ From version 0.21 of Hadoop, the job tracker does some checkpointing of its work in the filesystem.

Basic Allowable text file formats• TextInputFormat

• KeyValueTextInputFormat

• SequenceFileInputFormat

• SequenceFileasTextInputFormat

Primitive class

datatypes

int

float

Long

char

String

Box class

datatypes

IntWritable

FloatWritable

LongWritable

Text

TextBox class have by-default writable comparable interface.

(ByteOffset,EntireLine)(ByteOffset,EntireLine)(ByteOffset,EntireLine)(ByteOffset,EntireLine)

inputSplitinputSplit inputSplitinputSplit

RecordReader RecordReader RecordReader RecordReader

Mapper Mapper MapperMapper

Input file 200MB64MB

What is your name

Where do you live

64MB

I am Ishan

I live in Delhi

64MB

Name of your

college

I study in MAIT

8MB

What are you

hobbies

0, What is your

name19, where do you live

What,1Is,1Your,1Name,1

Where,1Do,1You,1Live,1

I ,1Am,1Ishan,1

I,1Live,1In,1Delhi,1

Name,1Of,1Your,1College,1

I,1Study,1In,1MAIT,1

What,1Are,1Your,1Hobbies,1

INTERMEDIATE DATA

WORDCOUNT JOB Animation in slide

Where,1

Do,1

You,1

Live,1

I ,1

Am,1

Ishan,1

I,1

Live,1

In,1

Delhi,1

Name,1

Of,1

Your,1

College,1

I,1

Study,1

In,1

MAIT,1

What,1

Are,1

Your,1

Hobbies,1

INTERMEDIATE DATA

What,2 . . . . . Is,1 . . . . .Your,3 . . . . .Name,2 . . . . .

SHUFFLING

Am,1Are,1..Your,3

SORTING

Reducer

RecordWriter OutputFile (PART-0000)

What,1

Is,1

Your,1

Name,1

What,1Is,1

Your,1

Name,1

What,1Are,1

Your,1

Hobbies,1

What,1

Is,1Your,1

Name,1

What,1

Is,1

Your,1Name,1

What,1

Is,1

Your,1

Name,1

Name,1

Of,1

Your,1College,1

What,1

Are,1

Your,1Hobbies,1

What,1

Are,1

Your,1

Hobbies,1

Name,1Of,1

Your,1

College,1

Fields where MapReduce can be

implemented

Distributed pattern-based searching

Distributed sorting

Web link-graph reversal

Web access log stats

Document clustering

Statistical machine translation.

Limitations of MapReduce

• It's not always very easy to implement each and everything as a MapReduce program.

• When your intermediate processes need to talk to each other.

• When your processing requires lot of data to be shuffled over the network.

• The fundamentals of Hadoop were not designed tofacilitate highly interactive analytics.

• The answer you get from a Hadoop cluster may or may not be 100% accurate, depending on the nature of the job.

• Multiple copies of already big data.

END OF PRESENTATION

THANKS FOR WATCHING…

map reduce in hadoop

Technology

job tracker

task trackers

task life cycle management

different domain

mapreduce program

data domain

pair of data

number of slots