advanced topics on mapreduce with hadoop jiaheng lu department of computer science renmin university...

Advanced topics on Mapreduce with Hadoop

Jiaheng Lu

Department of Computer Science

Renmin University of Chinawww.jiahenglu.net

Outline

Brief Review Chaining MapReduce Jobs Join in MapReduce Bloom Filter

Brief Review

A parallel programming framework Divide and merge

split0

split1

split2

Input data

Map task

Mappers

Map task

Shuffle

Reduce task

Reducers

Reduce task

Output data

output0

output1

Chaining MapReduce jobs

Chaining in a sequence Chaining with complex dependency Chaining preprocessing and postprocessing

Chaining in a sequence

Simple and straightforward [MAP | REDUCE]+; MAP+ | REDUCE | MAP* Output of last is the input to the next Similar to pipes

Configuration conf = getConf();

JobConf job = new JobConf(conf);

job.setJobName("ChainJob");

job.setInputFormat(TextInputFormat.class);

job.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(job, in);

FileOutputFormat.setOutputPath(job, out);

JobConf map1Conf = new JobConf(false);

ChainMapper.addMapper(job, Map1.class, LongWritable.class, Text.class, Text.class, Text.class, true, map1Conf);

Chaining with complex dependency

Jobs are not chained in a linear fashion

Use addDependingJob() method to add dependency information:

x.addDependingJob(y)

Chaining preprocessing and postprocessing steps

Example: remove stop word in IR Approaches:

Separate: inefficient Chaining those steps into a single job

Use ChainMapper.addMapper() and ChainReducer.setReducer

Map+ | Reduce | Map*

Join in MapReduce

Reduce-side join Broadcast join Map-side filtering and Reduce-side join

A given key A range from dataset(broadcast) a Bloom filter

Reduce-side join

Map output <key, value> key>>join key, value>>tagged with data source

Reduce do a full cross-product of values output the combination results

Example

table x

table y

valuetag

join key

shuffle()

valuelist

reduce()

1 ab b

1 cd b

4 ef c

output

Broadcast join (replicated join)

Broadcast the smaller table Do join in Map()

Using distributed cache

DistributedCache.addCacheFile()

Map-side filtering and Reduce-side join

Join key: student IDs from info generate IDs file from info broadcast join

What if the IDs file can’t be stored in memory? a Bloom Filter

A Bloom Filter

Introduction Implementation of bloom filter Use in MapReduce join

Introduction to Bloom Filter

space-efficient data structure, constant size, test elements, add(), contains()

no false negatives and a small probability of false positives

Implementation of bloom filter

Apply a bit array Add elements

generate k indexes set the k bits to 1

Test elements generate k indexes all k bits are 1 >> true, not all are 1 >> false

Example

add x(0,2,6)

add y(0,3,9)

contain m(1,3,9)

contain n(0,2,9)initial state

① ② ③ ④ ⑤

× √false positives

Use in MapReduce join

A separate subjob to create a Bloom Filter

Broadcast the Bloom Filter and use in Map() of join job

drop the useless record, and do join in reduce

References

Chunk Lam, “Hadoop in action” Jairam Chandar, “Join Algorithms using

Map/Reduce”

THANK YOU

Hadoop

advanced topics on mapreduce with hadoop jiaheng lu department of computer science renmin university...

mapreduce slide

false slide

hadoop slide

map of join job

mapreduce bloom filter

map1conf slide

broadcast join map

adddependingjoby slide

Documents

processing with what is mapreduce? hadoop/mapreduce

mapreduce & hadoop...

running head: accepted...

google’s mapreduce programming model —...

introduction to cloud computing jiaheng lu department of...

mapreduce. mapreduce outline mapreduce architecture...

renmin hospital of wuhan university the 16 th great wall...

eneryg efficiency for mapreduce workloads: an indepth study...

string similarity measures and joins with synonyms joint...

big data management – challenges and opportunities – an...

framework technologies and progress huang xingtao zou...

sigmetrics tutorial:...

office:room 1504/1508/1529,no#125,renmin east road

renmin university of china

pipelined-mapreduce an improved mapreduce

mapreduce · 2020. 7. 22. · hadoop is an implementation...

mrtuner: a toolkit to enable holistic optimization for ... a...

ee324 distributed systems fall 2015 mapreduce. overview 2 ...

data management in large-scale distributed systems -...

xml data management and approximate string matching jiaheng...