apache hadoop: dfs and map reduce

36
Apache Hadoop DFS and Map Reduce Víctor Sánchez Anguix Universitat Politècnica de València MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image Course 2014/2015

Upload: victor-sanchez-anguix

Post on 16-Jul-2015

207 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Apache Hadoop: DFS and Map Reduce

Apache HadoopDFS and Map Reduce

Víctor Sánchez AnguixUniversitat Politècnica de València

MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image

Course 2014/2015

Page 2: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Who has not heard about Hadoop?

Page 3: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Page 4: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Who knows exactly what is Hadoop?

Page 5: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Being simplistic:

What is Apache Hadoop?

DFS MapReduce

Page 6: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Google publishes paper about GFS (2003). http://research.google.com/archive/gfs.html

➢ Distributed data among cluster of computers

➢ Fault tolerant

➢ Highly scalable with commodity hardware

A bit of history: Distributed File System (DFS)

Page 7: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Google publishes paper about MR (2004). http://research.google.com/archive/mapreduce.html

➢ Algorithm for processing distributed data in parallel

➢ Simple in concept, extremely useful in practice

A bit of history: Map Reduce (MR)

Page 8: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Doug Cutting and Mike Caffarella → Apache Nutch

➢ Doug Cutting goes to Yahoo

➢ Yahoo implements Apache Hadoop

A bit of history: Hadoop is born

Page 9: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Framework for distributed computing

➢ Still based on DFS and MR

➢ It is the main actor in Big Data

➢ Last major release: Apache Hadoop 2.6.0 (Nov 2014)http://hadoop.apache.org/

Apache Hadoop now

Page 10: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

DFS architecture

Page 11: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Interacting with Hadoop DFS: creating dirs

➢ Examples:

hdfs dfs -mkdir data

hdfs dfs -mkdir results

Page 12: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Interacting with Hadoop DFS: uploading files

➢ Examples:

hdfs dfs -put datasets/students.tsv data/students.tsv

hdfs dfs -put datasets/grades.tsv data/grades.tsv

Page 13: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Interacting with Hadoop DFS: listing

➢ Examples:

hdfs dfs -ls data

Found 2 items

-rw-r--r-- 3 sanguix supergroup 450 2015-02-09 10:50 data/grades.tsv

-rw-r--r-- 3 sanguix supergroup 194 2015-02-09 10:45 data/students.tsv

Page 14: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Interacting with Hadoop DFS: get a file

➢ Examples:

hdfs dfs -get data/students.tsv

hdfs dfs -get data/grades.tsv

Page 15: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Interacting with Hadoop DFS: deleting files

➢ Examples:

hdfs dfs -rm data/students.tsv

hdfs dfs -rm data/grades.tsv

Page 16: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Interacting with Hadoop DFS: space use info

➢ Examples:

hdfs dfs -df -h

Filesystem Size Used Available Use%

hdfs://localhost 1.5 T 12 K 491.6 G 0%

Page 17: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map Reduce: Overview

Input data

Input data

Input data

Map task

Map task

Map task

Reduce task

Reduce task

Reduce task

Output data

Output data

Output data

chunk of data (key,value) value’

chunk of data (key,value) value’

Page 18: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map: Transform data to (key, value)

Input data

Input data

Input data

Map task

Map task

Map task

chunk of data

chunk of data

Page 19: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Shuffle: Send (key, values)

Reduce task

Reduce task

Reduce task

(key,value)

(key,value)

Map task

Map task

Map task

Page 20: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Reduce: Aggregating (key,values)

Reduce task

Reduce task

Reduce task

Output data

Output data

Output data

value’

value’

Page 21: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map Reduce

Input data

Input data

Input data

Map task

Map task

Map task

Reduce task

Reduce task

Reduce task

Output data

Output data

Output data

chunk of data (key,value) value’

chunk of data (key,value) value’

Page 22: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map Reduce example: word count

CHUNK 1this class is about big data and artificial intelligence

CHUNK 2there is nothing big about this example

CHUNK 3I am a big artificial intelligence enthusiast

➢ The file is divided in chunks to be processed in parallel

➢ Data is sent untransformed to map nodes

Page 23: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map Reduce example: word count

this class is about big data and artificial intelligence

[this, class, is, about, big, data, and, artificial, intelligence]

Tokenize

(this,1), (class,1), (is,1), (about,1), (big,1), (class, 1), (is, 1), (about 1), (big, 1), (data, 1), (and, 1), (artificial,1), (intelligence, 1)

Prepare (key,value) pairs

MAP TASK

Raw chunk

Ready to shuffle

Page 24: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map Reduce example: word countMap Reduce example: word count

(big,1)(big,1)(big,1)

(big,3)Sum

REDUCE TASK

Fromshuffle Output

Page 25: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Exercise: Matrix power

row column value1 1 3.2

2 3 4.3

3 3 5.1

1 3 0.1

Page 26: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map Reduce variants: No reduce

Input data

Input data

Input data

Map task

Map task

Map task

Output data

Output data

Output data

chunk of data (key,value)

chunk of data (key,value)

Page 27: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Map Reduce variants: chaining

Input data

Input data

Input data

Map task

Map task

Map task

Reduce task

Reduce task

Reduce task

Output data

Output data

Output data

Map task

Map task

Map task

Reduce task

Reduce task

Reduce task

Output data

Output data

Output data

Page 28: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Maps are executed in parallel➢ Reducers do not start until all maps are

finished➢ Output is not finished until all reducers are

finished➢ Bottleneck: Unbalanced map/reduce taks

○ Change key distribution

○ Increase reduces for increasing parallelism

Map Reduce: bottlenecks

Page 29: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Hadoop is implemented in Java

➢ It is possible to program jobs formed by maps and reduces in Java

➢ We won’t go deep in these matters (bear with me!)

Map Reduce in Hadoop

Page 30: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image http://hadoop.apache.org/

Hadoop architecture

Page 31: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

public class WordCount {

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

Map Reduce job in Hadoop

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

...

Page 32: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

Map Reduce job in Hadoop

Page 33: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Compilingjavac -cp opt/hadoop/share/hadoop/common/hadoop-common-2.6.0.jar:opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.6.0.jar -d WordCount source/hadoop/WordCount.java

jar -cvf WordCount.jar -C WordCount/ .

➢ Submitting

hadoop jar WordCount.jar es.upv.dsic.iarfid.haia.WordCount

/user/your_username/data/students.tsv /user/your_username/wc

Compiling and submitting a MR job

Page 34: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Hadoop ecosystem

Page 35: Apache Hadoop: DFS and Map Reduce

Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ http://hadoop.apache.org

➢ Hadoop in Practice. Alex Holmes. Ed. Manning Publications

➢ Hadoop: The Definitive Guide. Tom White. Ed. O’Reilly.

➢ StackOverflow

Extra information

Page 36: Apache Hadoop: DFS and Map Reduce

Apache HadoopDFS and Map Reduce

Víctor Sánchez AnguixUniversitat Politècnica de València

MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image

Course 2014/2015