hadoop ida mele. parallel programming parallel programming is used to improve performance and...

29
Hadoop Ida Mele

Upload: allyson-ashlyn-curtis

Post on 11-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Hadoop

Ida Mele

Page 2: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Parallel programming

• Parallel programming is used to improve performance and efficiency

• In a parallel program, the processing is broken up into parts, that run on different machines concurrently

• Super-computers can be replaced by large clusters of CPUs. These CPUs may be on the same machine or they may be in a network of computers

Ida Mele Hadoop 2

Page 3: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Parallel programming

Ida Mele Hadoop 3

Web graphic Super ComputerJanet E. Ward, 2000 Cluster of Desktops

Page 4: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Parallel programming

• We have to identify the sets of tasks that can run concurrently

• Parallel programming is suitable for large datasets where we can split the data in equal-size portions

• The number of tasks we can perform concurrently depends on the dimension of the original dataset and on how many CPUs (nodes) we have

Ida Mele Hadoop 4

Page 5: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Parallel programming

• One node is the master: It divides the problem in sub-problems and gives them to the other nodes, also called workers

• Worker node processes the small problem and returns the partial result to the master

• Once the master receives partial results from all the workers, it can combine them to compute the final result

Ida Mele Hadoop 5

Page 6: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Parallel programming: examples

• Assuming a big array, it can be split into small sub-arrays

• Assuming a large document collection, it can be split in documents, and again each document can be broken up in paragraphs, lines, …

• Note: not all problems can be parallelized. For example, the current value to compute depends on the previous one

Ida Mele Hadoop 6

Page 7: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

MapReduce

• MapReduce was developed by Google for processing large amount of raw data

• It is used for the distributed computing on clusters of computers

• MapReduce is an abstraction which allows programmers to implement distributed programs

• Distributed programming is complex, so MapReduce hides the issues related to parallelization, data distribution, load balancing, and fault tolerance

Ida Mele Hadoop 7

Page 8: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

MapReduce

• MapReduce is inspired by the map and reduce combinators of Lisp

• Map: (key1, val1) → (key2, val2). The map function takes as input <key,value> pairs and produces a set of zero or more intermediate <key,value> pairs

• The framework groups together all the intermediate values associated to the same intermediate key and passes them to the reducer

• Reduce: (key2, [val2]) → [val3]. The reduce function aggregates the values of a key by using a binary operation, such as the sum

Ida Mele Hadoop 8

Page 9: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

MapReduce: dataflow

• Input reader: it reads the data from the stable storage and divides it into portions (splits or shards), then it assigns one split to each map function

• Map function: it takes a series of <key, value> pairs and processes each of them to create zero or more intermediate <key, value> pairs

• Partition function: Between the map and the reduce stages, data is shuffled: parallel-sorted and exchanged among nodes. Data is moved from the node of the map to the correct shard in which it will be reduced. In order to do that, the partition function receives the key and the number of reducers, and it returns the index of the desired reducer load balancing

Ida Mele Hadoop 9

Page 10: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

MapReduce: dataflow

• Comparison function: the input of each reducer is sorted using the comparison function

• Reduce function: it takes the sorted intermediate pairs and aggregate the values by the keys, in order to produce a single output for each key

• Output writer: it writes the output of the reducer on the stable storage, that is usually the distributed file system

Ida Mele Hadoop 10

Page 11: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Example

• Consider the following problem:Given a large collection of documents, we want to compute the number of occurrences of each term in the documents

• How can we solve it in parallel?

Ida Mele Hadoop 11

Page 12: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Example

• We assume to have a set of workers1. Divide the collection of documents among them,

for example one document for each worker2. Each worker returns the count of a given word in

a document 3. Sum up the counts from all the documents to

have the overall number of occurrences of a word in the collection of documents

Ida Mele Hadoop 12

Page 13: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Example

map(String input_key, String input_value): // input_key: document name // input_value: document contents

for each word w in input_value: EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts

int result = 0; for each v in intermediate_values: result += ParseInt(v);

Emit(AsString(result));

Ida Mele Hadoop 13

Pseudo-code

Page 14: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Example

Ida Mele Hadoop 14

Page 15: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Example

Ida Mele Hadoop 15

Page 16: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Execution overview

Ida Mele Hadoop 16

Page 17: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Execution overview

1. The map invocations are distributed across multiple CPUs, by automatically partitioning the input data into M splits (or shards). The shards can be processed on different CPUs concurrently

2. One copy of the program is the master and it assigns the work to the other copies (workers). In particular it has M map tasks and R reduce tasks to assign, so the master picks the idle workers and assign each one a map or reduce task

3. The worker, which has the map task, reads the content of the input shard. It applies the user-defined operations in parallel, and it produces the intermediate <key,value> pairs

Ida Mele Hadoop 17

Page 18: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Execution overview

4. The intermediate pairs are written on the local disk periodically. Then, they are partitioned into R regions by the partitioning function. The locations of these pairs are passed to the master, which forwards them to the reduce workers

5. The reduce worker reads the intermediate pairs and sorts them by the intermediate key

6. The reduce worker iterates over the sorted pairs and applies the reduce function, which allows to aggregate the values for each key. Then, the output of the reducer can be appended to the output file

Ida Mele Hadoop 18

Page 19: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Reliability

• MapReduce allows to have high reliability: to detect failure the master pings every workers periodically. If a worker is silent for longer than a given time interval the master node marks the worker as failed and sends the failed worker’s work to another node

• When a failure occurs the completed map task has to be re-executed, since the output is stored on the local disk of the failed machine, and it is inaccessible. The completed reduce tasks do not need to be re-executed, because their output is stored in the global file system

• Some operations are atomic to ensure no conflicts

Ida Mele Hadoop 19

Page 20: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Hadoop

• Apache Hadoop is a open-source framework for reliable, scalable, and distributed computing. It implements the computational paradigm named MapReduce.

• Useful links:• http://hadoop.apache.org/• http://hadoop.apache.org/docs/stable/• http://hadoop.apache.org/releases.html#Download

Ida Mele Hadoop 20

Page 21: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Hadoop

• The project includes several modules:• Hadoop Common: the common utilities that support the

other Hadoop modules• Hadoop Distributed File System (HDFS): a distributed file

system that provides high-throughput access to application data

• Hadoop YARN: a framework for job scheduling and cluster resource management

• Hadoop MapReduce: a YARN-based system for parallel processing of large data sets

Ida Mele Hadoop 21

Page 22: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Install Hadoop

• Download the release 1.1.1 of Hadoop available at: http://www.dis.uniroma1.it/~mele/WebIR.html or one of the latest release of Hadoop available at:http://hadoop.apache.org/releases.html#Download

• The directory conf contains all configuration files• Set the JAVA_HOME by editing the file conf/hadoop-

env.sh:export JAVA_HOME= %JavaPath

Ida Mele Hadoop 22

Page 23: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Install Hadoop

• Optional: (if needed) we can specify the classpath

• Optional: we can specify the maximum amount of heap to use. The default is 1000 MB, but we can increase it by editing the file conf/hadoop-env.sh:export HADOOP_HEAPSIZE=2000

Ida Mele Hadoop 23

Page 24: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Install Hadoop

• Optional: we can specify the directory for the temporal output

• Edit the file conf/core-site.xml adding the following lines:

<property><name>hadoop.tmp.dir</name>

<value>/tmp/hadoop-tmp-${user.name}</value> <description>A base for other temporary

directories.</description></property>

Ida Mele Hadoop 24

Page 25: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Example: WordCounter

• Download WordCounter.jar and text.txt, available at: http://www.dis.uniroma1.it/~mele/WebIR.html

• Put WordCounter.jar in the Hadoop directory• In the Hadoop directory, create the sub-directory einput and

copy the input file text.txt into it• Run the word counter by issuing the following command:

bin/hadoop jar WordCounter.jar mapred.WordCount einput/ eoutput/Note: make sure that the output directory doesn't already exist

Ida Mele Hadoop 25

Page 26: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Example: WordCounter

• The Map class:

Ida Mele Hadoop 26

Page 27: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Example: WordCounter

• The Reduce class:

Ida Mele Hadoop 27

Page 28: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Example: WordCounter

more eoutput/part-0000

Ida Mele Hadoop 28

'1500,' 1'Come, 1'Go 1'Hareton 1'Here 1'I 1'If 1'Joseph!' 1'Mr. 2'No 1'No, 1'Not 1'She's 1

words

occurrences

Page 29: Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken

Example: WordCounter

sort -k2 -n -r eoutput/part-00000 | more

Ida Mele Hadoop 29

the 93of 73and 64a 60I 57to 47my 27in 23his 19with 18have 16that 15

Most frequent words

Words frequencies sorted in decreasing order