mapreduce - read-only

MapReduce

Vitaly Shmatikov

CS 5450

BackRub (Google), 1997

slide 2

NEC Earth Simulator, 2002

slide 3

Conventional HPC Machine

uCompute nodes• High-end processors• Lots of RAM

uNetwork• Specialized• Very high performance

uStorage server• RAID disk array

slide 4

HPC Programming Model

uPrograms described at a very low level• Detailed control of

processing and communication

uRely on small number of software packages• Written by specialists• Limits classes of

problems and solution methods

slide 5

Typical HPC Operation

uCharacteristics• Long-lived processes• Make use of spatial locality• Hold all data in memory (no disk access)• High-bandwidth communication

uStrengths• High utilitization of resources• Effective for many scientific applications

uWeaknesses• Requires careful tuning of applications to

resources• Intolerant of any variability

slide 6

HPC Fault Tolerance

uCheckpoint• Periodically store state of all

processes• Significant I/O traffic

uRestore after failure• Reset state to last checkpoint• All intervening computation

wasteduPerformance scaling

• Very sensitive to the number of failing components

slide 7

wasted

Google Data Center, 2016

slide 8

Dozens of these!

Ideal Cluster Programming Model

uApplications written in terms of high-level operations on the data

uRuntime system controls scheduling, load balancing

slide 9

MapReduce, 2004

slide 10

MapReduce Programming Model

uMap computation across many data objects• For example, 1010 webpages

uAggregate results in many different waysuSystem deals with resource allocation and availability

slide 11

Programming in Lisp

slide 12

Computing the sum of squares

• (map square ‘(1 2 3 4))Processes each record sequentially and

independentlyOutput: (1 4 9 16)

• (reduce + ‘(1 4 9 16))Computes (+ 16 (+ 9 (+ 4 1) ) )Output: 30

Application: Word Count

slide 13

SELECT count(word) FROM data GROUP BY word

Partial Aggregation

1. In parallel, each worker computes word counts from individual files

2. Collect results, wait until all finished3. Merge intermediate output4. Compute word count on merged intermediates

slide 14

Map

slide 15

Process individual records to generate (key, value) pairs

Welcome Everyone

Hello Everyone

Welcome 1Everyone 1 Hello 1Everyone 1

ValueKey

Parallel Map

slide 16

Process individual records to generate (key, value) pairs in parallel

Welcome Everyone

Hello Everyone

Welcome 1Everyone 1

Hello 1Everyone 1

Key Value

map task 1

map task 2

Process large number of records in parallel

Reduce

slide 17

Merge all intermediate values per key


ValueKey

Everyone 2Welcome 1 Hello 1

ValueKey

Partitioning

slide 18

Merge all intermediate values in parallel:partition keys, assign each key to one Reduce


ValueKey

Everyone 2Welcome 1 Hello 1

ValueKey

Reduce task 1

Reduce task 2

MapReduce API

slide 19

map(key, value) -> list(<k’, v’>)

• Applies function to (key, value) pair and produces set of intermediate pairs

reduce(key, list<value>) -> <k’, v’>

• Applies aggregation function to values• Outputs result

MapReduce WordCount

slide 20

map(key, value):for each word w in value:

EmitIntermediate(w, "1");

reduce(key, list(values): int result = 0;

for each v in values: result += ParseInt(v);

Emit(AsString(result));

Optimizations

slide 21

combine(list<key, value>) -> list<k,v>

• Perform partial aggregation on each mapper node<the, 1>, <the, 1>, <the, 1> à <the, 3>

• reduce() should be commutative and associative

partition(key, int) -> int• Need to aggregate intermediate values with same key• Given n partitions, map key to partition 0 ≤ i < n

– Typically via hash(key) mod n - but not always!

Putting It Together

slide 22

map combine partition reduce

Synchronization Barrier

slide 23

App: grep

u Input: large set of filesuOutput: lines that match pattern

uMap:Output a line if it matches the supplied patternuReduce: Copy the intermediate data to output

slide 24

App: Reverse Web-Link Graph

u Input: web graph, i.e., tuples (A, B) where page A links to page B

uOutput: for each page, list of pages that link to it

uMap:For each (source, target), output (target, source)uReduce: Output (target, list(source))

slide 25

App: URL Access Frequencyu Input: log of accessed URLs (e.g., from a proxy)u Output: for each URL, % of total accesses for that URL

u Map:For each URL, output (URL, 1)u Multiple reducers: Output (URL, urlCount)Chain another MapReduce...u Map:For each (URL, urlCount), output (1, (URL, urlCount))u Reduce: Compute TotalCount, output multiple (URL, urlCount/TotalCount)

slide 26

App: Sorting

u Input: series of valuesuOutput: sorted values

uMap:For each value, output (value, _)uPartitioning:Partition keys across reducers based on ranges

• Take data distribution into accounts to balance reducer tasksuReduce: Copy the intermediate data to output

slide 27

MapReduce Features

uCharacteristics• Computation broken up into

many short-lived tasks (mapping, reducing)

• Disk storage to hold intermediate results

uStrengths• Very flexible placement,

scheduling, load balancing• Can access large datasets

uWeaknesses• Higher overhead• Lower raw performance

slide 28

MapReduce Execution

slide 29

Fault Tolerance in MapReduce

uMap worker writes intermediate output to local disk, separated by partitioning; once completed, tells master node

uReduce worker told of location of mapper outputs, pulls their partition’s data, executes function across data

uNote:• “All-to-all” shuffle between mappers and

reducers• Written to disk (“materialized”) before each

stage

slide 30

Fault Tolerance in MapReduce

uMaster node monitors state of system• If master fails, job aborts

uMap worker failure• In-progress and completed tasks marked as idle• Reduce workers notified when map task is re-

executed on another map workeruReducer worker failure

• In-progress tasks are reset and re-executed• Completed tasks had been written to global file

systemslide 31

Stragglers

uStraggler = task that takes long time to execute• Bugs, flaky hardware, poor partitioning

uFor slow map tasks, execute in parallel on second “map” worker as backup, race to complete task

uWhen done with most tasks, reschedule any remaining executing tasks• Keep track of redundant executions• Significantly reduces overall run time

slide 32

Straggler Mitigation

slide 33

Goodbye MapReduce

slide 34

Modern Data Processing

slide 35

Big driver: machine learning

mapreduce - read-only

Documents