map reduce and hadoop on windows
DESCRIPTION
Map Reduce Introduction And Its Implementation.TRANSCRIPT
- 1. Map Reduce
Muhammad UsmanShahid
Software Engineer [email protected]
10/17/2011
1
2. Parallel Programming
Used for performance and efficiency.
Processing is broken up into parts and done concurrently.
Instruction of each part run on a separate CPU while many
processors are connected.
Identification of set of tasks which can run concurrently is
important.
A Fibonacci function is Fk+2 = Fk + Fk+1.
It is clear that Fibonacci function can not be parallelized as each
computed value depends on previous.
Now consider a huge array which can be broken up into
sub-arrays.
10/17/2011
2
3. Parallel Programming
10/17/2011
3
If each element required some processing, with no dependencies in
the computation, we have an ideal parallel computing
opportunity.
4. Google Data Center
Google believes buy cheap computers but numerous in number.
Google has parallel processing concept in its data centers.
Map Reduce is a parallel and distributed approach developed by
Google for processing large data sets.
10/17/2011
4
5. Map Reduce Introduction
Map Reduce has two key components. Map and Reduce.
Map function is used on input values to calculate a set of
key/Value pairs.
Reduce aggregates this data into a scalar.
10/17/2011
5
6. Data Distribution
Input files are split into M pieces on distributed file
systems.
Intermediate files are created from map tasks are written to local
disks.
Output files are written to distributed file systems.
10/17/2011
6
7. Data Distribution
10/17/2011
7
8. Map Reduce Function
Map Reduce function by an example see the query Select
Sum(stuMarks) from student group by studentSection.
In above query select phase is doing the same as Map do and Group
By same as Reduce Phase.
10/17/2011
8
9. Classical Example
The classical example of Map Reduce is the log file analysis.
Big log files are split and mapper search for different web pages
which are accessed.
Every time a web page is found in the log a key/value pair is
emitted to the reducer in such way that key = web page and value =
1.
The reducer aggregates the number for a certain web pages.
Result is the count of total hits for each web page.
10/17/2011
9
10. Reverse Web Link Graph
In this example Map function outputs (URL target, source) from an
input web page (source).
Reduce function concatenates the list of all source URL(s) with a
give target of URL and returns (target, list(source)).
10/17/2011
10
11. Other Examples
Map Reduce can be used for the lot of problems.
For Example the Google used the Map Reduce for the calculation of
page ranks.
Word count in large set of documents can also be resolved by Map
Reduce very efficiently.
Google library for the Map Reduce is not open source but an
implementation in java called hadoop is an open source.
10/17/2011
11
12. Implementation of Example
Word Count is a simple application that counts the number of
occurrences of words in a given set of inputs.
Hadoop library is used for its implementation.
Code is given in the below attached file.
10/17/2011
12
13. Usage of Implementation
For example the input files are $ bin/hadoopdfs -ls
/usr/joe/wordcount/input//usr/joe/wordcount/input/file01/usr/joe/wordcount/input/file02
$ bin/hadoopdfs -cat /usr/joe/wordcount/input/file01Hello World Bye
World
$ bin/hadoopdfs -cat /usr/joe/wordcount/input/file02Hello Hadoop
Goodbye Hadoop
Run the application.
Word Count is straight forward problem.
10/17/2011
13
14. Walk Through Implementation
TheMapperimplementation (lines 14-26), via themapmethod (lines
18-25), processes one line at a time, as provided by the
specifiedTextInputFormat(line 49). It then splits the line into
tokens separated by whitespaces, via theStringTokenizer, and emits
a key-value pair of< , 1>.
For the given sample input the first map emits:< Hello,
1>< World, 1>< Bye, 1>< World, 1>
The second map emits:< Hello, 1>< Hadoop, 1><
Goodbye, 1>< Hadoop, 1>
10/17/2011
14
15. Walk Through Implementation
WordCountalso specifies acombiner(line 46). Hence, the output of
each map is passed through the local combiner (which is same as
theReduceras per the job configuration) for local aggregation,
after being sorted on thekeys.
The output of the first map:< Bye, 1>< Hello, 1><
World, 2>
The output of the second map:< Goodbye, 1>< Hadoop,
2>< Hello, 1>
10/17/2011
15
16. Walk Through Implementation
TheReducerimplementation (lines 28-36), via thereducemethod (lines
29-35) just sums up the values, which are the occurence counts for
each key (i.e. words in this example).
Thus the output of the job is:< Bye, 1>< Goodbye,
1>< Hadoop, 2>< Hello, 2>< World, 2>
Therunmethod specifies various facets of the job, such as the
input/output paths (passed via the command line), key/value types,
input/output formats etc., in theJobConf. It then calls
theJobClient.runJob(line 55) to submit the and monitor its
progress.
10/17/2011
16
17. Execution Overview
10/17/2011
17
18. Map Reduce Execution
Map Reduce library is the user program that first splits the input
files in M pieces. Then it start ups many copies of the program on
cluster of machines.
One of the copy is special The Master other are the workers. There
are M Map tasks and R Reduce tasks to assign. The master picks the
idle workers and assign them the Map task or Reduce Task.
A worker who is assigned Map task reads the contents of
corresponding input split. It parses the key value pair and pass it
to user defined Map function this generates the intermediate
key/value pairs buffered in the memory.
Periodically, the buffered pairs are written to local disks. The
locations of these buffered pairs on local disks are passed back to
the master, who is responsible for forwarding them to the reducer
workers.
10/17/2011
18
19. Map Reduce Execution
When master notify a reduce worker about these location, it uses
RPC to access this local data, then it sorts the data.
The reduce worker iterates over the sorted intermediate data, for
each unique key it passes the key and values to the reduce
function. The output is appended to the final output file.
Many associated issues are handled by the library like
Parallelization
Fault Tolerance
Data Distribution
Load Balancing
10/17/2011
19
20. Debugging
Offer human readable status info on http server, user can see jobs
In progress, Completed etc.
Allows use of GDB and other debugging tools.
10/17/2011
20
21. Conclusions
Simplifies large scale computations that fit this model.
Allows user to focus on the problem without worrying about the
details.
It is being used by renowned companies like Google and Yahoo.
Google library for Map Reduce is not open source but a project of
Apache called hadoop is an open source library for Map
Reduce.
10/17/2011
21